Project

General

Profile

Actions

Feature #5106

open

Watchdog that would try to un-BORKE BORKen timeslots

Added by keith almost 3 years ago. Updated about 1 year ago.

Status:
Feedback
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
Start date:
04/04/2021
Due date:
% Done:

60%

Spec Reference:

Description

In #5096 that discussed lchans ending up borken on a busy ericsson bts:

"What would probably make sense is some kind of watchdog that would try to un-BORKE BORKen timeslots after a certain time. This can be done by activating a broken sub-slot and releasing it immediately. If the BTS still refuses to activate it, then it's completely BORKen and the second attempt to un-BORKE can be postponed further."

(https://osmocom.org/issues/5096#note-12)


Files

e1_lost.pcapng e1_lost.pcapng 2.28 MB keith, 11/23/2022 09:28 PM

Checklist

  • add test
  • adjust lchan-fsm.dot
  • adjust lchan.msc
  • print count of recover attempts in vty code?

Related issues

Related to OsmoBSC - Bug #5096: rbs2000 ALL BORKEN ChannelsResolvedkeith03/26/202104/04/2021

Actions
Related to OsmoBSC - Feature #5785: enable either TCP or IPA keep-alive by defaultResolvedlaforge11/23/2022

Actions
Related to OsmoBSC - Bug #5786: no error message if keepalive options not supportedResolvedosmith11/23/2022

Actions
Related to OsmoBSC - Feature #4537: OsmoBSC needs strategies to recover broken lchans (lchan state BORKEN)In Progress05/08/2020

Actions
Actions #1

Updated by keith almost 3 years ago

  • Related to Bug #5096: rbs2000 ALL BORKEN Channels added
Actions #2

Updated by neels almost 3 years ago

It would make most sense to introduce a timeout on an lchan entering the BORKEN state, to retry to resurrect the lchan by some standard actions: release the lchan, activate then release, things like that. This could be limited to say three times before entering the BORKEN state "forever".

Also we could consider to not enter the BORKEN state on a chan act NACK, but simply regard the lchan as not in use again. It seems that the reported case (I heard) included some invalid request in the chan act, and hence the NACK would simply mean to go back to UNUSED.

Actions #3

Updated by keith over 1 year ago

  • Priority changed from Low to Urgent

Bumping this, as I am seeing this quite a lot since the introduction of TCH/F_TCH/H_SDCCH8_PDCH

OsmoBSC# show lchan summary 2
BTS 2, TRX 0, Timeslot 2 TCH/H, Lchan 0, Type TCH_H TSC-s1c7, State ESTABLISHED - L1 MS Power: 33 dBm RXL-FULL-dl:  -83 dBm RXL-FULL-ul:  -93 dBm
BTS 2, TRX 0, Timeslot 4 TCH/F_TCH/H_SDCCH8_PDCH as TCH/F, Lchan 0, Type TCH_F TSC-s1c7, State ESTABLISHED - L1 MS Power: 33 dBm RXL-FULL-dl:  -93 dBm RXL-FULL-ul:  -86 dBm
BTS 2, TRX 0, Timeslot 5 TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 0, Type NONE TSC-s1c0, State BORKEN - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm
BTS 2, TRX 0, Timeslot 5 TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 1, Type NONE TSC-s1c0, State BORKEN - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm
BTS 2, TRX 0, Timeslot 5 TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 2, Type NONE TSC-s1c0, State BORKEN - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm
BTS 2, TRX 0, Timeslot 5 TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 3, Type NONE TSC-s1c0, State BORKEN - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm
BTS 2, TRX 0, Timeslot 5 TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 4, Type NONE TSC-s1c0, State BORKEN - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm
BTS 2, TRX 0, Timeslot 5 TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 5, Type NONE TSC-s1c0, State BORKEN - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm
BTS 2, TRX 0, Timeslot 5 TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 6, Type NONE TSC-s1c0, State BORKEN - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm
BTS 2, TRX 0, Timeslot 5 TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 7, Type NONE TSC-s1c0, State BORKEN - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm
BTS 2, TRX 0, Timeslot 6 TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 0, Type NONE TSC-s1c0, State BORKEN - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm
BTS 2, TRX 0, Timeslot 6 TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 1, Type NONE TSC-s1c0, State BORKEN - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm
BTS 2, TRX 0, Timeslot 6 TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 2, Type NONE TSC-s1c0, State BORKEN - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm
BTS 2, TRX 0, Timeslot 6 TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 3, Type NONE TSC-s1c0, State BORKEN - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm
BTS 2, TRX 0, Timeslot 6 TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 4, Type NONE TSC-s1c0, State BORKEN - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm
BTS 2, TRX 0, Timeslot 6 TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 5, Type NONE TSC-s1c0, State BORKEN - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm
BTS 2, TRX 0, Timeslot 6 TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 6, Type NONE TSC-s1c0, State BORKEN - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm
BTS 2, TRX 0, Timeslot 6 TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 7, Type NONE TSC-s1c0, State BORKEN - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm

Actions #4

Updated by osmith over 1 year ago

  • Status changed from New to In Progress
  • Assignee changed from neels to osmith
Actions #5

Updated by osmith over 1 year ago

I'm working on a patch to implement the retry mechanism.

keith wrote in #note-3:

Bumping this, as I am seeing this quite a lot since the introduction of TCH/F_TCH/H_SDCCH8_PDCH

It would be good to understand why this happens though, can you provide logs that show why the BORKEN state is entered? (all state changes to TS_ST_BORKEN go through ts_fsm_error() where a LOGL_ERROR message should be written.)

Actions #6

Updated by osmith over 1 year ago

(keith: added you to watchers of the issue, not sure if you received notifications before)

Actions #7

Updated by neels over 1 year ago

osmith, btw, when you enhance the lchan FSM, it would be great if you could also adjust osmo-bsc/doc/lchan-fsm.dot and lchan.msc accordingly, so the charts reflect the implementation.

Actions #8

Updated by keith over 1 year ago

This looks weird, I will try to get something better and the Abis with NACKs

Nov 10 14:51:49 nuyoo osmo-bsc[10707]: DCHAN lchan_fsm.c:911 lchan(1-0-4-OSMO_DYN-7)[0x563d32ce6960]{WAIT_ACTIV_ACK}: (type=SDCCH) lchan allocation failed in state WAIT_ACTIV_ACK: Chan Activ NACK: Service or Option not implemented (0x4f)
Nov 10 14:51:49 nuyoo osmo-bsc[10707]: DCHAN lchan_fsm.c:141 lchan(1-0-4-OSMO_DYN-7)[0x563d32ce6960]{WAIT_ACTIV_ACK}: (type=SDCCH) Tx Immediate Assignment Reject (lchan allocation failed in state WAIT_ACTIV_ACK: Chan Activ NACK: Service or Option not implemented (0x4f))
Nov 10 14:57:42 nuyoo osmo-bsc[10707]: DCHAN abis_rsl.c:2969 lchan(1-0-4-OSMO_DYN-0)[0x563d32e3c740]{WAIT_RF_RELEASE_ACK}: (type=TCH_F) Rx IPACC DLCX IND (cause=Normal event, unspecified [ 0f ])
Nov 10 14:58:24 nuyoo osmo-bsc[10707]: DCHAN abis_rsl.c:2969 lchan(1-0-4-OSMO_DYN-0)[0x563d32e3c740]{WAIT_RF_RELEASE_ACK}: (type=TCH_F) Rx IPACC DLCX IND (cause=Normal event, unspecified [ 0f ])
Nov 10 14:59:08 nuyoo osmo-bsc[10707]: DCHAN lchan_fsm.c:1827 lchan(1-0-4-OSMO_DYN-0)[0x563d32e3c740]{WAIT_RF_RELEASE_ACK}: (type=TCH_F) lchan failure in state WAIT_RF_RELEASE_ACK: Timeout
Nov 10 14:59:13 nuyoo osmo-bsc[10707]: DCHAN abis_rsl.c:2969 lchan(1-0-4-OSMO_DYN-0)[0x563d32e3c740]{BORKEN}: (type=NONE) Rx IPACC DLCX IND (cause=Normal event, unspecified [ 0f ])
Nov 10 14:59:33 nuyoo osmo-bsc[10707]: DCHAN abis_rsl.c:2969 lchan(1-0-4-OSMO_DYN-0)[0x563d32e3c740]{WAIT_RF_RELEASE_ACK}: (type=TCH_F) Rx IPACC DLCX IND (cause=Normal event, unspecified [ 0f ])
Nov 10 15:01:28 nuyoo osmo-bsc[10707]: DCHAN abis_rsl.c:2969 lchan(1-0-4-OSMO_DYN-0)[0x563d32e3c740]{WAIT_RF_RELEASE_ACK}: (type=TCH_F) Rx IPACC DLCX IND (cause=Normal event, unspecified [ 0f ])
Nov 10 15:01:41 nuyoo osmo-bsc[10707]: DCHAN abis_rsl.c:2969 lchan(1-0-4-OSMO_DYN-0)[0x563d32e3c740]{WAIT_RF_RELEASE_ACK}: (type=TCH_F) Rx IPACC DLCX IND (cause=Normal event, unspecified [ 0f ])
Nov 10 15:04:16 nuyoo osmo-bsc[10707]: DCHAN abis_rsl.c:2969 lchan(1-0-4-OSMO_DYN-0)[0x563d32cd86b0]{WAIT_RF_RELEASE_ACK}: (type=TCH_F) Rx IPACC DLCX IND (cause=Normal event, unspecified [ 0f ])
Nov 10 15:04:55 nuyoo osmo-bsc[10707]: DCHAN abis_rsl.c:2969 lchan(1-0-4-OSMO_DYN-0)[0x563d32cd86b0]{WAIT_RF_RELEASE_ACK}: (type=TCH_F) Rx IPACC DLCX IND (cause=Normal event, unspecified [ 0f ])

Actions #9

Updated by keith over 1 year ago

Just taking a look at logging level set-all error, after resetting all lchans that were broken: then after a short while:

OsmoBSC# show lchan summary
BTS 0, TRX 0, Timeslot 1 SDCCH8, Lchan 0, Type SDCCH TSC-s1c7, State ESTABLISHED - L1 MS Power: 33 dBm RXL-FULL-dl:  -92 dBm RXL-FULL-ul:  -99 dBm
BTS 0, TRX 0, Timeslot 1 SDCCH8, Lchan 2, Type SDCCH TSC-s1c7, State WAIT_BEFORE_RF_RELEASE - L1 MS Power: 33 dBm RXL-FULL-dl:  -67 dBm RXL-FULL-ul:  -74 dBm
BTS 0, TRX 0, Timeslot 1 SDCCH8, Lchan 3, Type SDCCH TSC-s1c7, State WAIT_BEFORE_RF_RELEASE - L1 MS Power: 33 dBm RXL-FULL-dl:  -96 dBm RXL-FULL-ul:  -98 dBm
BTS 0, TRX 0, Timeslot 1 SDCCH8, Lchan 4, Type SDCCH TSC-s1c7, State ESTABLISHED - L1 MS Power: 33 dBm RXL-FULL-dl: -100 dBm RXL-FULL-ul: -101 dBm
BTS 0, TRX 0, Timeslot 1 SDCCH8, Lchan 5, Type SDCCH TSC-s1c7, State ESTABLISHED - L1 MS Power: 33 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -105 dBm
BTS 0, TRX 0, Timeslot 4 TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 0, Type NONE TSC-s1c0, State BORKEN - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm
BTS 0, TRX 0, Timeslot 4 TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 1, Type NONE TSC-s1c0, State BORKEN - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm
BTS 0, TRX 0, Timeslot 4 TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 2, Type NONE TSC-s1c0, State BORKEN - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm
BTS 0, TRX 0, Timeslot 4 TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 3, Type NONE TSC-s1c0, State BORKEN - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm
BTS 1, TRX 0, Timeslot 1 SDCCH8, Lchan 0, Type SDCCH TSC-s1c7, State ESTABLISHED - L1 MS Power: 33 dBm RXL-FULL-dl:  -92 dBm RXL-FULL-ul: -106 dBm
BTS 1, TRX 0, Timeslot 1 SDCCH8, Lchan 1, Type SDCCH TSC-s1c7, State ESTABLISHED - L1 MS Power: 33 dBm RXL-FULL-dl:  -81 dBm RXL-FULL-ul:  -93 dBm
BTS 1, TRX 0, Timeslot 1 SDCCH8, Lchan 2, Type SDCCH TSC-s1c7, State ESTABLISHED - L1 MS Power: 33 dBm RXL-FULL-dl:  -83 dBm RXL-FULL-ul:  -87 dBm
BTS 1, TRX 0, Timeslot 4 TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 0, Type NONE TSC-s1c0, State BORKEN - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm
BTS 1, TRX 0, Timeslot 4 TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 1, Type NONE TSC-s1c0, State BORKEN - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm
BTS 1, TRX 0, Timeslot 4 TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 2, Type NONE TSC-s1c0, State BORKEN - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm
BTS 2, TRX 0, Timeslot 1 SDCCH8, Lchan 1, Type SDCCH TSC-s1c7, State WAIT_BEFORE_RF_RELEASE - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -109 dBm
BTS 2, TRX 0, Timeslot 1 SDCCH8, Lchan 2, Type SDCCH TSC-s1c7, State ESTABLISHED - L1 MS Power: 33 dBm RXL-FULL-dl:  -81 dBm RXL-FULL-ul:  -97 dBm
BTS 2, TRX 0, Timeslot 1 SDCCH8, Lchan 5, Type SDCCH TSC-s1c7, State ESTABLISHED - L1 MS Power: 33 dBm RXL-FULL-dl:  -83 dBm RXL-FULL-ul:  -86 dBm
BTS 3, TRX 0, Timeslot 1 SDCCH8, Lchan 1, Type SDCCH TSC-s1c7, State ESTABLISHED - L1 MS Power: 33 dBm RXL-FULL-dl:  -47 dBm RXL-FULL-ul:  -87 dBm
BTS 3, TRX 0, Timeslot 1 SDCCH8, Lchan 4, Type SDCCH TSC-s1c7, State WAIT_BEFORE_RF_RELEASE - L1 MS Power: 33 dBm RXL-FULL-dl:  -77 dBm RXL-FULL-ul:  -81 dBm
BTS 3, TRX 0, Timeslot 1 SDCCH8, Lchan 5, Type SDCCH TSC-s1c7, State ESTABLISHED - L1 MS Power: 33 dBm RXL-FULL-dl:  -96 dBm RXL-FULL-ul:  -83 dBm
BTS 3, TRX 0, Timeslot 1 SDCCH8, Lchan 6, Type SDCCH TSC-s1c7, State ESTABLISHED - L1 MS Power: 33 dBm RXL-FULL-dl:  -99 dBm RXL-FULL-ul: -102 dBm
BTS 3, TRX 0, Timeslot 6 TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 0, Type NONE TSC-s1c0, State BORKEN - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm
BTS 3, TRX 0, Timeslot 6 TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 1, Type NONE TSC-s1c0, State BORKEN - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm
BTS 3, TRX 0, Timeslot 6 TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 2, Type NONE TSC-s1c0, State BORKEN - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm
BTS 3, TRX 0, Timeslot 6 TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 3, Type NONE TSC-s1c0, State BORKEN - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm
BTS 4, TRX 0, Timeslot 1 SDCCH8, Lchan 0, Type SDCCH TSC-s1c7, State WAIT_BEFORE_RF_RELEASE - L1 MS Power: 33 dBm RXL-FULL-dl:  -85 dBm RXL-FULL-ul:  -79 dBm
BTS 4, TRX 0, Timeslot 1 SDCCH8, Lchan 1, Type SDCCH TSC-s1c7, State ESTABLISHED - L1 MS Power: 33 dBm RXL-FULL-dl: -108 dBm RXL-FULL-ul: -101 dBm
BTS 4, TRX 0, Timeslot 1 SDCCH8, Lchan 2, Type SDCCH TSC-s1c7, State ESTABLISHED - L1 MS Power: 33 dBm RXL-FULL-dl:  -91 dBm RXL-FULL-ul:  -92 dBm
OsmoBSC#

We have the corresponding log entries for the now BORKEN channels.

Nov 18 14:09:45 nuyoo osmo-bsc[1436]: DCHAN ERROR lchan_fsm.c:1827 lchan(3-0-6-OSMO_DYN-0)[0x564c3186f920]{WAIT_ACTIV_ACK}: (type=SDCCH) lchan allocation failed in state WAIT_ACTIV_ACK: Timeout
Nov 18 14:09:45 nuyoo osmo-bsc[1436]: DCHAN ERROR lchan_fsm.c:1827 lchan(3-0-6-OSMO_DYN-3)[0x564c31872840]{WAIT_ACTIV_ACK}: (type=SDCCH) lchan allocation failed in state WAIT_ACTIV_ACK: Timeout
Nov 18 14:09:45 nuyoo osmo-bsc[1436]: DCHAN ERROR lchan_fsm.c:1827 lchan(3-0-6-OSMO_DYN-2)[0x564c31872600]{WAIT_ACTIV_ACK}: (type=SDCCH) lchan allocation failed in state WAIT_ACTIV_ACK: Timeout
Nov 18 14:09:45 nuyoo osmo-bsc[1436]: DCHAN ERROR lchan_fsm.c:1827 lchan(3-0-6-OSMO_DYN-1)[0x564c3186fad0]{WAIT_ACTIV_ACK}: (type=SDCCH) lchan allocation failed in state WAIT_ACTIV_ACK: Timeout
Nov 18 14:30:34 nuyoo osmo-bsc[1436]: DCHAN ERROR lchan_fsm.c:1827 lchan(1-0-4-OSMO_DYN-0)[0x564c31868e40]{WAIT_ACTIV_ACK}: (type=SDCCH) lchan allocation failed in state WAIT_ACTIV_ACK: Timeout
Nov 18 14:30:34 nuyoo osmo-bsc[1436]: DCHAN ERROR lchan_fsm.c:1827 lchan(1-0-4-OSMO_DYN-2)[0x564c3186b860]{WAIT_ACTIV_ACK}: (type=SDCCH) lchan allocation failed in state WAIT_ACTIV_ACK: Timeout
Nov 18 14:30:34 nuyoo osmo-bsc[1436]: DCHAN ERROR lchan_fsm.c:1827 lchan(1-0-4-OSMO_DYN-1)[0x564c31868ff0]{WAIT_ACTIV_ACK}: (type=SDCCH) lchan allocation failed in state WAIT_ACTIV_ACK: Timeout
Nov 18 14:36:25 nuyoo osmo-bsc[1436]: DCHAN ERROR lchan_fsm.c:1827 lchan(0-0-4-OSMO_DYN-0)[0x564c31839660]{WAIT_ACTIV_ACK}: (type=SDCCH) lchan allocation failed in state WAIT_ACTIV_ACK: Timeout
Nov 18 14:36:25 nuyoo osmo-bsc[1436]: DCHAN ERROR lchan_fsm.c:1827 lchan(0-0-4-OSMO_DYN-2)[0x564c31839a50]{WAIT_ACTIV_ACK}: (type=SDCCH) lchan allocation failed in state WAIT_ACTIV_ACK: Timeout
Nov 18 14:36:25 nuyoo osmo-bsc[1436]: DCHAN ERROR lchan_fsm.c:1827 lchan(0-0-4-OSMO_DYN-1)[0x564c31839810]{WAIT_ACTIV_ACK}: (type=SDCCH) lchan allocation failed in state WAIT_ACTIV_ACK: Timeout
Nov 18 14:36:25 nuyoo osmo-bsc[1436]: DCHAN ERROR lchan_fsm.c:1827 lchan(0-0-4-OSMO_DYN-3)[0x564c3183b690]{WAIT_ACTIV_ACK}: (type=SDCCH) lchan allocation failed in state WAIT_ACTIV_ACK: Timeout

So I would think simply loss of connectivity on the Abis link?

Actions #10

Updated by neels over 1 year ago

On Fri, Nov 18, 2022 at 08:59:34PM +0000, keith wrote:

Nov 18 14:36:25 nuyoo osmo-bsc1436: DCHAN ERROR lchan_fsm.c:1827 lchan(0-0-4-OSMO_DYN-3)[0x564c3183b690]{WAIT_ACTIV_ACK}: (type=SDCCH) lchan allocation failed in state WAIT_ACTIV_ACK: Timeout

So I would think simply loss of connectivity on the Abis link?

Looks like it.

I assume extending the 5 seconds OsmoBSC waits for an ACTIV ACK by default does
not make sense (5 seconds is very long already). Otherwise that could be done
by the 'net X6' timer in osmo-bsc.cfg.

osmith, just sharing the tought, in case some of the Channel Activ still got
through, but the Activ ACK did not come back to the BSC, it may mean that an
lchan is currently active from the BTS POV, but BSC sees it as borken.
Recovering from that borken state would have to be ok with it if another
Channel Activ attempt results in a NACK message; a subsequent RF Channel Rel
may then still succeed.

BTS       BSC
<--Activ--
---ACK->xx          (ACK lost due to link loss)
<timeout>
       borken.
       recovery:
<--Activ--
---NACK-->          (because BTS sees channel as already active)
do not despair
<--Rel----
---ACK--->
yay unborken
Actions #11

Updated by laforge over 1 year ago

On Mon, Nov 21, 2022 at 12:47:26AM +0000, neels wrote:

osmith, just sharing the tought, in case some of the Channel Activ still got
through, but the Activ ACK did not come back to the BSC, it may mean that an
lchan is currently active from the BTS POV, but BSC sees it as borken.
Recovering from that borken state would have to be ok with it if another
Channel Activ attempt results in a NACK message; a subsequent RF Channel Rel
may then still succeed.

without knowing all the details/background here, this kind of problem would then only
happen on Ericsson (or possibly other E1) BTS, as in osmo-bts (AFAIR) we drop all state
(channel allocations, etc) when RSL goes down, right?

But then getting back to the E1 BTS situation: As far as I understand in your deployments
the E1 link is something like 1-2m of copper wire between the icE1usb
(at the BSC) and the BTS. In that situation there should never be any
link loss (LAPD/L2 or even E1 alignment/L1). If there is, it points to
a bug in our LAPD code, osmo-e1d, icE1usb or the wiring/connections.

How often do you see the RSL LAPD link go down?

Actions #12

Updated by osmith over 1 year ago

  • Checklist item add test added
  • Checklist item adjust lchan-fsm.dot added
  • Checklist item adjust lchan.msc added
  • Checklist item print count of recover attempts in vty code? added
  • % Done changed from 0 to 30

Analyzing why exactly it fails for you (see messages above) sounds like the proper way towards fixing this.

However given that this is set to urgent, here's a WIP patch to add the recover from borken mechanism. It implements the logic but doesn't have a test yet and so I haven't verified if this works as expected:
https://gerrit.osmocom.org/c/osmo-bsc/+/30252

neels wrote in #note-10:

osmith, just sharing the tought, in case some of the Channel Activ still got
through, but the Activ ACK did not come back to the BSC, it may mean that an
lchan is currently active from the BTS POV, but BSC sees it as borken.
Recovering from that borken state would have to be ok with it if another
Channel Activ attempt results in a NACK message; a subsequent RF Channel Rel
may then still succeed.

Thanks this makes sense, implemented it like that.

Actions #13

Updated by osmith over 1 year ago

  • Checklist item add test set to Done
  • % Done changed from 30 to 60
Actions #14

Updated by keith over 1 year ago

laforge wrote in #note-11:

without knowing all the details/background here, this kind of problem would then only
happen on Ericsson (or possibly other E1) BTS, as in osmo-bts (AFAIR) we drop all state
(channel allocations, etc) when RSL goes down, right?

Let me try to clarify.

1) There's no E1 BTS here, this is osmo-bsc + Abis/IP and osmo-bts.
2) RSL didn't go down. Everything else continues to operate, we just missed an ACK.
3) On this final point I don't have clarity, I'm afraid.

  • - Why did the ACK Timeout? - We do not know.
    • wifi just got hammered at that moment with interference? ubiquiti hit some weird CPU loop + locked up for a few seconds?
    • The ethernet switch in the BTS stopped forwarding packets?
    • Divine Intervention, cosmic rays? your guess is as good as mine.
    • And worst.. I can't reproduce it at will.

How often do you see the RSL LAPD link go down?

I don't. It's not about that, It's about the not so 100% reliable Abis I first mentioned in about 2016 or so - although I guess this kind of recovery strategy would also work on an E1.

Actions #15

Updated by keith over 1 year ago

So, this is the comparison of states on one BTS:

OsmoBSC# show lchan summary 3
BTS 3, TRX 0, Timeslot 2 TCH/H, Lchan 0, Type TCH_H TSC-s1c7, State ESTABLISHED - L1 MS Power: 33 dBm RXL-FULL-dl:  -96 dBm RXL-FULL-ul:  -90 dBm
BTS 3, TRX 0, Timeslot 2 TCH/H, Lchan 1, Type TCH_H TSC-s1c7, State ESTABLISHED - L1 MS Power: 33 dBm RXL-FULL-dl:  -88 dBm RXL-FULL-ul:  -87 dBm
BTS 3, TRX 0, Timeslot 4 TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 0, Type NONE TSC-s1c0, State BORKEN - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm
BTS 3, TRX 0, Timeslot 4 TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 1, Type NONE TSC-s1c0, State BORKEN - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm
BTS 3, TRX 0, Timeslot 4 TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 2, Type NONE TSC-s1c0, State BORKEN - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm
BTS 3, TRX 0, Timeslot 4 TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 3, Type NONE TSC-s1c0, State BORKEN - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm
BTS 3, TRX 0, Timeslot 4 TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 4, Type NONE TSC-s1c0, State BORKEN - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm
BTS 3, TRX 0, Timeslot 4 TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 5, Type NONE TSC-s1c0, State BORKEN - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm
BTS 3, TRX 0, Timeslot 4 TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 6, Type NONE TSC-s1c0, State BORKEN - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm
BTS 3, TRX 0, Timeslot 5 TCH/F_TCH/H_SDCCH8_PDCH as TCH/F, Lchan 0, Type NONE TSC-s1c0, State BORKEN - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm
BTS 3, TRX 0, Timeslot 6 TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 0, Type NONE TSC-s1c0, State BORKEN - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm
BTS 3, TRX 0, Timeslot 6 TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 1, Type NONE TSC-s1c0, State BORKEN - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm
BTS 3, TRX 0, Timeslot 6 TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 2, Type NONE TSC-s1c0, State BORKEN - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm
BTS 3, TRX 0, Timeslot 6 TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 3, Type NONE TSC-s1c0, State BORKEN - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm
BTS 3, TRX 0, Timeslot 6 TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 4, Type NONE TSC-s1c0, State BORKEN - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm

and


OsmoBTS# show lchan summary
BTS 0, TRX 0, Timeslot 2 (primary) TCH/H, Lchan 0, Type TCH_H, State ACTIVE - RXL-FULL-ul:  -81 dBm
BTS 0, TRX 0, Timeslot 2 (primary) TCH/H, Lchan 1, Type TCH_H, State ACTIVE - RXL-FULL-ul:  -89 dBm
BTS 0, TRX 0, Timeslot 4 (primary) TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 0, Type SDCCH, State BROKEN UNUSABLE - RXL-FULL-ul: -110 dBm
BTS 0, TRX 0, Timeslot 4 (primary) TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 1, Type SDCCH, State BROKEN UNUSABLE - RXL-FULL-ul: -110 dBm
BTS 0, TRX 0, Timeslot 4 (primary) TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 2, Type SDCCH, State BROKEN UNUSABLE - RXL-FULL-ul: -110 dBm
BTS 0, TRX 0, Timeslot 4 (primary) TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 3, Type SDCCH, State BROKEN UNUSABLE - RXL-FULL-ul: -110 dBm
BTS 0, TRX 0, Timeslot 4 (primary) TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 4, Type SDCCH, State BROKEN UNUSABLE - RXL-FULL-ul: -110 dBm
BTS 0, TRX 0, Timeslot 4 (primary) TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 5, Type SDCCH, State BROKEN UNUSABLE - RXL-FULL-ul: -110 dBm
BTS 0, TRX 0, Timeslot 4 (primary) TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 6, Type SDCCH, State BROKEN UNUSABLE - RXL-FULL-ul: -110 dBm
BTS 0, TRX 0, Timeslot 6 (primary) TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 0, Type SDCCH, State BROKEN UNUSABLE - RXL-FULL-ul: -110 dBm
BTS 0, TRX 0, Timeslot 6 (primary) TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 1, Type SDCCH, State BROKEN UNUSABLE - RXL-FULL-ul: -110 dBm
BTS 0, TRX 0, Timeslot 6 (primary) TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 2, Type SDCCH, State BROKEN UNUSABLE - RXL-FULL-ul: -110 dBm
BTS 0, TRX 0, Timeslot 6 (primary) TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 3, Type SDCCH, State BROKEN UNUSABLE - RXL-FULL-ul: -110 dBm
BTS 0, TRX 0, Timeslot 6 (primary) TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 4, Type SDCCH, State BROKEN UNUSABLE - RXL-FULL-ul: -110 dBm

Actions #16

Updated by laforge over 1 year ago

keith wrote in #note-14:

laforge wrote in #note-11:

without knowing all the details/background here, this kind of problem would then only
happen on Ericsson (or possibly other E1) BTS, as in osmo-bts (AFAIR) we drop all state
(channel allocations, etc) when RSL goes down, right?

Let me try to clarify.

1) There's no E1 BTS here, this is osmo-bsc + Abis/IP and osmo-bts.

I'm a bit surprised as the ticket starts with "In #5096 that discussed lchans ending up borken on a busy ericsson bts: [...]"

2) RSL didn't go down. Everything else continues to operate, we just missed an ACK.

You cannot "miss" an ACK. RSL over IP is a TCP stream. It delivers a stream of messages in order. If one message were lost "in flight", none of the following messages would ever arrive.

So there are only the following options
  1. osmo-bts is broken and doesn't ever send the ACK
  2. the ACK is sent but RSL goes down + recovers and we somehow loose state here
  3. there is a temporary stall due to network reliability issues. In this case the ACK does eventually (later) get delivered, but somehow the BSC side FSM is no longer accepting it

all of those cases should be very easy to differentiate by looking at a pcap of the time in question. I would assume you'd have rotating pcap files in place in order to get the pcap file of the time frame it happens? If not, that might be a good idea (see something like tcpdump -C foo -W bar -w /my/file/base tcp port 3002 or tcp port 3003 or udp port 4729).

  • - Why did the ACK Timeout? - We do not know.
  • wifi just got hammered at that moment with interference? ubiquiti hit some weird CPU loop + locked up for a few seconds?
  • The ethernet switch in the BTS stopped forwarding packets?
  • Divine Intervention, cosmic rays? your guess is as good as mine.
  • And worst.. I can't reproduce it at will.

that all still "just" means delay and also means the ACK would eventually get delivered. That's how TCP works.

Please note that we have other production users with hundreds of BTS connected via equally hundreds of sometimes not partiuclarly reliable / flaky satellite back-haul. So flaky that it regularly causes the RSL and OML connections to fail + reconnect.

So far I had assumed this is an Ericsson Abis/E1 related problem, which would explain the obvious difference.

How often do you see the RSL LAPD link go down?

I don't. It's not about that, It's about the not so 100% reliable Abis I first mentioned in about 2016 or so

Either packets "just" get delayed or RSL connection goes down. Those are very easy to distinguish just by looking at whether ot not SYN packets arrive on the RSL TCP port.

Please also note that since one year ago (December 2021) we have a stats_tcp module in libosmocore which you can use to expose various health parameters of the TCP sockets of any osmocom program via stats, and hence include it in your normal performance monitoring/graphing. Parameters like number of unacknowledged/lost/retransmitted packets, the current RTT, the number of bytes waiting to be sent, etc.

I'm currently looking for the documentation of it, but I cannot find any mention of it in osmo-gsm-manuals.git. dexter, can you please help where is the related documentation of this feature? From looking at the source code, there are two vty commands stats-tcp interval and stats-tcp batch-size. The Interval must be non-zero to enable that feature. The batch size determines how many TCP sockets are processed every interval.

As we are talking about an ACK that is arriving potentially late, we'd ideally use this on the BTS side of the link to get useful data. But also the BSC side should give us some indirect idea of the health of the Abis RSL.

Actions #17

Updated by laforge over 1 year ago

keith wrote in #note-9:

Nov 18 14:09:45 nuyoo osmo-bsc1436: DCHAN ERROR lchan_fsm.c:1827 lchan(3-0-6-OSMO_DYN-0)[0x564c3186f920]{WAIT_ACTIV_ACK}: (type=SDCCH) lchan allocation failed in state WAIT_ACTIV_ACK: Timeout

Just to clarify: This is T=-6 (X6) which is by default set to 5 seconds. Did you ever play with that setting? X6 is configurable via VTY, after all.

The question is then (with a high timeout value like 30s): when will other bits further up the stack (assignment FSM) start to timeout, if the channel activation now takes too long. But that might have a behavior that has less overall impact: The current call or whatever wanted to obtain the channel fails, and the lchan doesn't get into 'borken' state.

In general, the tcp_stats on RTT would probably give you a good idea of what's happening with your RSL link, and if RTT >= 5s does happen in practice.

I'm also starting to wonder whether we should disable/bypass that entire "Borken" logic in your setup. The general purpose of it is to prevent the BSC from failing tons of transactions if one lchan is dead (but others are not), and hence it would fail lots of calls while perfectly working lchans are available left and right. But this assumes that

  1. there are failure modes in which only a given lchan is dead (more likely with more "hardware" BTSs where there are dedicated processors or hardware for a timeslot, less likely with "software" BTSs)
  2. the Abis link has a RTT of << Timer X6 (default 5s)
There are two solutions, that I can see:
  • make the "borken" feature vty-configurable (and disable it in your unreliable-abis situation), OR
  • improve the "borken" logic to only enter that state if something like N successive activations of the same lchan fail. So we'd have a per-lchan counter which increments at every ACK timeout, and which resets at every ACK received in time.
    • if we then also have (not sure if we do today) a global BTS/RSL "alive" timer with less than N*X6 (for example N=10, X6=5s = 50s), then we have achieved the ideal behavior in all situations, without any VTY settings:
      • a lchan would still get BORKEN if it repeatedly fails activation
      • a lchan would not get BORKEN if it's a temporary/intermittent delay of Abis RSL transmission
      • if the entire RLS link is not responsive for too long, all the state gets reset, RSL re-connects and no channels are BORKEN from that point on onwards anyway

Does that make sense?

Actions #18

Updated by laforge over 1 year ago

Some more bits when we declare a RSL/OML TCP connection for dead (based on src/input/ipaccess.c in libosmo-abis)

TPC keep-alive

  • assuming line->keepalive_num_probes is non-zero (e1_line <0-255> keepalive with or without parameters in VTY)
    • we set SO_KEEPALIVE to enable the TCP keepalive mechanism
    • we set TCP_KEEPIDLE (number of seconds TCP connection needs to be idle before sending TCP keepalives) to DEFAULT_TCP_KEEPALIVE_IDLE_TIMEOUT (30s)
    • we set TCP_KEEPINTVL (seconds between keepalive probes) to DEFAULT_TCP_KEEPALIVE_INTERVAL (3s)
    • we set TCP_KEEPCNT (number of unanswered keepalive to drop connection) to DEFAULT_TCP_KEEPALIVE_RETRY_COUNT (10)
    • we set TCP_USER_TIMEOUT (maximum amount of time in milliseconds that transmitted data may rmain unacknowledged or the tx window remains stalled) to the following sum/product of the above (30+3*10 = 60s)

So with the default parameters, the RSL or OML link is definitely declared dead after stalling for 60s if the e1_line <0-255> keepalive is used without any other parameters. Is this the case, keith ?

IPA keep-alive

It's actually a surprise to me that we support both TCP-level and IPA level keep-alives. Probably an oversight.

e1_line <0-255> ipa-keepalive <1-300> <1-300> is the command to configure that. It appears disabled by default, and is certainly inferior to the kernel/TCP level keep-alives.

Actions #19

Updated by laforge over 1 year ago

In other words, a X6 timeout of 60s and the default keepalive parameters resulting in 60s should ensure that no channel ever enters BORKEN and that rather the entire RSL link is declared dead, resulting in all state to be reset. I think after 60s it's fair to assume no meaningful operation is possible with that BTS anyway. Of course one could also reduce both to 30s.

The important question is just how the higher-layer FSMs above the lchan_fsm will react to a lchan_activate() taking more than the usual 5s timeout.

  • assignment_fsm: ASSIGNMENT_ST_WAIT_LCHAN_ACTIVE is 10s, meaning it will give up and fail gracefully
  • handover_fsm: uses .T = 23042 but I don't see 23042 ever set to any default value? neels ?
  • immediate assignment: I think those RACH will just never get an answer back and timeout (Good)

so we should be all good in that situation.

Actions #20

Updated by laforge over 1 year ago

  • Status changed from In Progress to Feedback
  • Assignee changed from osmith to keith
Actions #21

Updated by laforge over 1 year ago

  • Related to Feature #5785: enable either TCP or IPA keep-alive by default added
Actions #22

Updated by laforge over 1 year ago

  • Related to Bug #5786: no error message if keepalive options not supported added
Actions #23

Updated by neels over 1 year ago

Analysis: there is an lchan->last_error field that should remember the reason
why the BORKEN state was entered. If there was a timeout it should be visible
in the VTY output via
'show lchan [<0-255>] [<0-255>] [<0-7>] [<0-7>]'
(maybe this should also be shown in 'show lchan summary' via
lchan_dump_short_vty())

Timeout: There still is one problem with the "Delay only" theory, because the
lchan BORKEN state does receive any late Activ ACK and would directly RF
Release the lchan again and go back to the UNUSED state. Meaning that lchans
should not stay in BORKEN for long when the Activ ACK merely arrives late.
Maybe that code path is buggy, AFAIR we don't have a test for that.
Giving a larger X6 timeout should bypass the BORKEN state in the lchan FSM, but
since the BORKEN state detects a late Activ ACK, it shouldn't actually make
much of a difference (unless there's a bug).
https://cgit.osmocom.org/osmo-bsc/tree/src/osmo-bsc/lchan_fsm.c#n1508

On Wed, Nov 23, 2022 at 09:41:35AM +0000, laforge wrote:

  • make the "borken" feature vty-configurable (and disable it in your unreliable-abis situation)

I don't see a way to "disable" the BORKEN state. What to do instead? If
osmo-bsc misses a timely Activ ACK, then we cannot just transition to UNUSED
and send another Activ request later. If the lchan is still in an active state
on the BTS side, we first need to send a release to free the lchan, or all
subsequent Activ requests will be NACKed. If OTOH the BTS side is not in the
active state, the RF Release will fail and we're broken again.
The way I see out of BORKEN is the patch that osmith is currently writing. It
tries both Activ and RF Release without being strict about the outcomes, and
when the release succeeds gets the lchan back in a valid UNUSED state.

  • improve the "borken" logic to only enter that state if something like N
    successive activations of the same lchan fail. So we'd have a per-lchan
    counter which increments at every ACK timeout, and which resets at every ACK
    received in time.

If the BTS received the Activ message, the lchan will be in the active state on
the BTS side, and any amount of activations will fail when there is no
intermittent release.

Retrying should maybe rather happen in the assignment_fsm: if an lchan fails,
try N more lchan activations, picking a different lchan, and the failed lchan
has a bit of time to get out of the broken state. (This is just to still save
the voice call when the activation failed, not really to avoid the BORKEN
state). That might rapidly use up N lchans. To solve that, the BORKEN state
should also rapidly try to recover the lchan (short timeout).

Actions #24

Updated by neels over 1 year ago

  • handover_fsm: uses .T = 23042 but I don't see 23042 ever set to any default value? neels ?

Hm! It is also not listed:
https://osmocom.org/projects/cellular-infrastructure/wiki/List_of_Timer_numbers
Hence the timeout actually used will be 5 seconds in each state, the default of ho_fsm_state_chg()

We could / should publish these timers on the VTY config, i created https://osmocom.org/issues/5787

Actions #25

Updated by keith over 1 year ago

laforge wrote in #note-18:

So with the default parameters, the RSL or OML link is definitely declared dead after stalling for 60s if the e1_line <0-255> keepalive is used without any other parameters. Is this the case, keith ?

E1 config:

e1_input
 e1_line 0 driver ipa
 e1_line 0 port 0
 no e1_line 0 keepalive
 ipa bind 172.16.0.1
Actions #26

Updated by laforge over 1 year ago

On Wed, Nov 23, 2022 at 06:49:10PM +0000, keith wrote:

So with the default parameters, the RSL or OML link is definitely declared dead after stalling for 60s if the e1_line <0-255> keepalive is used without any other parameters. Is this the case, keith ?

E1 config:

> e1_input
>  e1_line 0 driver ipa
>  e1_line 0 port 0
>  no e1_line 0 keepalive
>  ipa bind 172.16.0.1
> 

ok, so there is neither tcp level nor ipa level keepalives active in
your config: Unfortunately the worst of both worlds in the context of this
discussion.

As stated I suggest to change to enabling the TCP keepalives with
default values and at the same time increasing X6 to 60s, and then see
if this avoids your problem.

Actions #27

Updated by keith over 1 year ago

So with e1_line 0 keepalive, [no param], we have this:

20221123133910968 DLINP NOTICE Sign link problems, closing socket. Reason: Connection timed out (ipaccess.c:391)
20221123133910968 DLINP NOTICE E1TS Forcing socket shutdown (ipaccess.c:95)
20221123133910968 DLINP NOTICE E1L Signalling link down (bts_ipaccess_nanobts.c:835)
20221123133910968 DLINP NOTICE (bts=1) Dropping OML link: link down (bts_ipaccess_nanobts.c:611)
20221123133910968 DLMI ERROR (bts=1,trx=0) Lost E1 OML link (osmo_bsc_main.c:450)
20221123133910968 DLMI ERROR (bts=1,trx=0) Lost E1 OSMO link (osmo_bsc_main.c:450)
20221123133910968 DLINP NOTICE (bts=1,trx=0) Dropping RSL link: OML link drop (bts_ipaccess_nanobts.c:588)
20221123133910968 DLMI ERROR (bts=1,trx=0) Lost E1 RSL link (osmo_bsc_main.c:450)
20221123133942325 DLINP NOTICE 172.16.0.1:3002 accept()ed new link from 172.16.0.12:41117 (ipa.c:284)
20221123133942325 DLINP NOTICE Keepalive is set: 0 (ipaccess.c:611)

with of course:
Received a TEI down on a OML link: 2 (0/s 0/m 2/h 1/d)
reflecting it.

What's also strange then it seems to be the BTS itself not responding, rather than the link; I deduce that because the slave in a sbts2050 times out while the master does not, and a BTS (Umsite) that is one WiFi hop further on the same link is not timing out.

Sorry if i've been time wasting. From my observation position, it's hard to evaluate everything correctly, or to stay on top of absolutely everything. I honestly thought that this BORKEN recover idea would be a good thing generally, not just for rhizomatica case. Seems I'm wrong about that.

Actions #28

Updated by keith over 1 year ago

laforge I wonder if you might take a look at the pcap for us. You know a lot more about tcp than I do.

The capture filter in use was "host 172.16.0.11 or host 172.16.0.12 and not udp", those two hosts are the master/slave of a 2050. At the same time running this capture, I'm SYN-pinging the two hosts with mtr -T -i 2 (two seconds interval), you'll see that in the pcap, seems mtr uses destintation port 80.

Skip down to packet no 17336, this is the syn towards port 3002 from osmo-bts restarting after osmo-bsc dropped it due to:

20221123144950328 DLINP NOTICE Sign link problems, closing socket. Reason: Connection timed out (ipaccess.c:391)                                                                   
20221123144950328 DLINP NOTICE E1TS(0:1) Forcing socket shutdown (ipaccess.c:95)                                                                                                   
20221123144950328 DLINP NOTICE E1L(0) Signalling link down (bts_ipaccess_nanobts.c:835)                                                                                            
20221123144950328 DLINP NOTICE (bts=1) Dropping OML link: link down (bts_ipaccess_nanobts.c:611)                                                                                   
20221123144950328 DLMI ERROR (bts=1,trx=0) Lost E1 OML link (osmo_bsc_main.c:450)                                                                                                              

If we go back in the pcap to wall-clock time 14:49:50, the time osmo-bsc detected timeout, and take a look previous to that, sure we can see various retransmissions, and at the same time, you can see the SYN followed by RST,ACK from mtr in packet 16247,16248

We also see an TCP ACK from 172.16.0.12 at 14:49:41 and a CCCH Load indication at 14:49:42, plus a whole bunch of packets between 172.16.0.12 and 172.16.0.1 in the time between 14:49:30 and 14:49:50

There's even a MEAS RES at 14:49:49 and tcp ACK, packets 16406,7

Why would osmo-bsc drop it a barely seconds later?

The e1_input config has
e1_line 0 keepalive [no further params]

In the meantime, I'm going to try increasing the timeout retries which IIUC means I should use:

e1_line 0 keepalive 30 20 3
Actions #29

Updated by keith over 1 year ago

Actions #30

Updated by osmith over 1 year ago

  • Status changed from Feedback to Resolved
  • % Done changed from 60 to 100

Applied in changeset commit:osmo-bsc|0ee9a2f08427e9ef89d239b89efdaf72fe66c6e4.

Actions #31

Updated by osmith over 1 year ago

  • Status changed from Resolved to Feedback
  • % Done changed from 100 to 60

(changing state back, as the related patch was not merged, redmine probably picked it up from my branch...)

Actions #32

Updated by neels over 1 year ago

On Wed, Nov 23, 2022 at 08:03:13PM +0000, keith wrote:

Sorry if i've been time wasting. From my observation position, it's hard to evaluate everything correctly, or to stay on top of absolutely everything. I honestly thought that this BORKEN recover idea would be a good thing generally, not just for rhizomatica case. Seems I'm wrong about that.

I do very much think that it is generally useful, and this "stubborn" behavior
has always bothered me. The lack of recovery from a broken lchan is the way it
is only because when transitioning from osmo-nitb to a separate osmo-bsc back
in the days, i tried to stay as close to the previous behavior as possible
(because the transition was complex enough as it was), and then we never found
the time to properly handle that situation. The issue of sticky broken lchans
does recur every now and then, not only with Rhizomatica, and IMHO it is a
great operational improvement to implement a useful recovery path. I believe it
is by definition Osmocom's goal to be useful when the links aren't exactly
industry top grade reliable, and what helps Rhizomatica also helps other future
community installations, or ones we don't even know about...

Actions #33

Updated by laforge over 1 year ago

On Fri, Nov 25, 2022 at 12:16:28PM +0000, neels wrote:

The issue of sticky broken lchans does recur every now and then, not
only with Rhizomatica, and IMHO it is a great operational improvement
to implement a useful recovery path. I believe it is by definition
Osmocom's goal to be useful when the links aren't exactly industry top
grade reliable, and what helps Rhizomatica also helps other future
community installations, or ones we don't even know about...

My main point is: Before adding a kludge on top of another kludge,
lets revisit whether the existing kludge is needed in its exact current
behavior.

Several people have observed stuck BORKEN state lchans, but IMHO
nobody has ever seen the problem that we (you?) tried to solve: A
broken lchan/TS in hardware which gets allocated again and again,
failing each time, blocking most new calls on the cell.

From there, we have the options:

Either optionally or penanently remove the entire BORKEN approach: Just
fall the LCHAN allocation and let higher layers (assignment fsm, ..) recover from there.

Or change the BORKEN detection logic to only happen if N successive
activations fail, rather than at the first time.

Or change the BORKEN logic to only trigger if actual NACKs are received,
but not in timeout situations.

Actions #34

Updated by neels over 1 year ago

On Fri, Nov 25, 2022 at 01:00:15PM +0000, laforge wrote:

From there, we have the options:

Either optionally or penanently remove the entire BORKEN approach: Just
fall the LCHAN allocation and let higher layers (assignment fsm, ..) recover from there.

Or change the BORKEN detection logic to only happen if N successive
activations fail, rather than at the first time.

Again, AFAICT this makes no sense in any of the failure cases I can imagine. As
I see it, there has to be a separate recovery path when timeout or NACK
happened, to confirm the state of the BTS and nudge it back to a usable state.

So far there is no kludge really, there is simply a full halt. The BORKEN state
is little more than a log message that something unexpected happened. We need
to make that un-sticky, not deny that anything unexpected happened.

I am pretty sure what needs to happen in the various cases, is:

  • an Activ NACK could be for these reasons:
    • the Activation requested something invalid. Then it makes no sense to resend
      the same request N times.
    • the lchan is in a transitory state and not yet ready for activation. Here it
      might make sense to resend the same request a bit later, however, this also
      means that somehow the BSC lost track of what state the BTS is in.
    • the lchan is already active and would become available from a Release only.
  • in case of receiving no response at all:
    • either the BTS received the Activ and its ACK/NACK response got lost somehow,
      so the BTS-lchan is already active now but we're not aware of it.
    • or the BTS didn't receive the Activation request in the first place,
      so the BTS-lchan is not active, but we can't be sure of that.

In all of these cases I see most sense in making sure that BSC and BTS sync up
their notion of what state the lchan is in. The way to achieve that is to flip
the lchan in and out of being active without a subscriber depending on that: Do
a trivial Activation, don't care about its result, and then send a Release.
Only after we get a Release ACK do we know for certain that the lchan is in a
valid UNUSED state. Let the typical wait-after-release timeout pass and only
then consider using the lchan again.

In none of the above cases does it make sense to me to directly regard the
lchan as UNUSED again and try an activation again (no matter if once or five
times).

Regarding higher layer recovery, I believe this aspect is unrelated to the fact
that we absolutely need to have a firm handle on what state the BTS thinks the
lchan is in.

For me osmith's patch seems to be the only useful solution short of dropping
the OML to restart the entire BTS.

Am I missing something in my reasoning?

Hmm, maye we could restart only the single physical channel via OML?

Actions #35

Updated by laforge over 1 year ago

On Fri, Nov 25, 2022 at 02:52:04PM +0000, neels wrote:

So far there is no kludge really, there is simply a full halt. The BORKEN state
is little more than a log message that something unexpected happened. We need
to make that un-sticky, not deny that anything unexpected happened.

In my thinking he BORKEN state ist the first kludge.

I am pretty sure what needs to happen in the various cases, is:
[...]
In all of these cases I see most sense in making sure that BSC and BTS sync up
their notion of what state the lchan is in.

unfortunately neither of those cases cover what is actually happening at rhizomatica: The TCP connection of
RSL simply stalls for some time and the CHAN ACT ACK arrives late. That's why I suggested
differentiating between an actual error and a simple "late ACK" situation.

In none of the above cases does it make sense to me to directly regard the
lchan as UNUSED again and try an activation again (no matter if once or five
times).

Probably not. But it does make in the case we are observing here [TCP stall/delay] :)

Actions #36

Updated by keith over 1 year ago

laforge wrote in #note-35:

The TCP connection of
RSL simply stalls for some time and the CHAN ACT ACK arrives late. That's why I suggested
differentiating between an actual error and a simple "late ACK" situation.

To my (mis-?)understanding, the plot thickens somewhat, watching it with the tcp keepalives implemented (see the e1_lost pcap I attached) - I see that there's nothing wrong with the link, but something (osmo-bts?, kernel tcpstack?) did not respond in a timely manner anyway.

Actions #37

Updated by laforge over 1 year ago

keith wrote in #note-28:

laforge I wonder if you might take a look at the pcap for us.

done. Indeed, something is wrong here. We can see that downlink messages like the PAGING COMMAND are immediately acknowledged afterwards. See for example 16421 with its next-sequence-number at ....6511 and 16423 shhowing the BTS has ACKed that.

So based on this pcap file I don't understand why the bsc-side kernel TCP stack thinks the connection has timed out. Looks like something is broken there :/

Actions #38

Updated by laforge over 1 year ago

regarding keepalives and socket options, see also
https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die/

Actions #39

Updated by keith over 1 year ago

Note that it's really any further information, but I see this just now with a SysmoBTS 1020 connected via ethernet cable, which is just once again confirming that the states are synchronized between BSC+BTS

OsmoBSC# show lchan summary
BTS 0, TRX 0, Timeslot 1 TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 0, Type NONE TSC-s1c0, State BORKEN - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm             
BTS 0, TRX 0, Timeslot 1 TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 1, Type NONE TSC-s1c0, State BORKEN - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm             
BTS 0, TRX 0, Timeslot 1 TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 2, Type NONE TSC-s1c0, State BORKEN - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm             
BTS 0, TRX 0, Timeslot 1 TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 3, Type NONE TSC-s1c0, State BORKEN - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm             
BTS 0, TRX 0, Timeslot 4 TCH/F_TCH/H_SDCCH8_PDCH as TCH/H, Lchan 0, Type TCH_H TSC-s1c7, State ESTABLISHED - L1 MS Power: 33 dBm RXL-FULL-dl: -100 dBm RXL-FULL-ul:  -99 dBm       
BTS 0, TRX 0, Timeslot 5 TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 0, Type NONE TSC-s1c0, State BORKEN - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm             
BTS 0, TRX 0, Timeslot 5 TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 1, Type NONE TSC-s1c0, State BORKEN - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm             
BTS 0, TRX 0, Timeslot 5 TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 2, Type NONE TSC-s1c0, State BORKEN - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm             
BTS 0, TRX 0, Timeslot 5 TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 3, Type NONE TSC-s1c0, State BORKEN - L1 MS Power: 0 dBm RXL-FULL-dl: -110 dBm RXL-FULL-ul: -110 dBm
OsmoBSC#
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
OsmoBTS# show lchan summary
BTS 0, TRX 0, Timeslot 1 (primary) TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 0, Type SDCCH, State BROKEN UNUSABLE - RXL-FULL-ul: -110 dBm                                           
BTS 0, TRX 0, Timeslot 1 (primary) TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 1, Type SDCCH, State BROKEN UNUSABLE - RXL-FULL-ul:  -75 dBm                                           
BTS 0, TRX 0, Timeslot 1 (primary) TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 2, Type SDCCH, State BROKEN UNUSABLE - RXL-FULL-ul:  -71 dBm                                           
BTS 0, TRX 0, Timeslot 1 (primary) TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 3, Type SDCCH, State BROKEN UNUSABLE - RXL-FULL-ul:  -89 dBm                                           
BTS 0, TRX 0, Timeslot 4 (primary) TCH/F_TCH/H_SDCCH8_PDCH as TCH/H, Lchan 0, Type TCH_H, State ACTIVE - RXL-FULL-ul: -101 dBm                                                     
BTS 0, TRX 0, Timeslot 5 (primary) TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 0, Type SDCCH, State BROKEN UNUSABLE - RXL-FULL-ul: -110 dBm                                           
BTS 0, TRX 0, Timeslot 5 (primary) TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 1, Type SDCCH, State BROKEN UNUSABLE - RXL-FULL-ul: -110 dBm                                           
BTS 0, TRX 0, Timeslot 5 (primary) TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 2, Type SDCCH, State BROKEN UNUSABLE - RXL-FULL-ul: -110 dBm                                           
BTS 0, TRX 0, Timeslot 5 (primary) TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 3, Type SDCCH, State BROKEN UNUSABLE - RXL-FULL-ul: -110 dBm                                           
OsmoBTS#

Actions #40

Updated by neels over 1 year ago

On Thu, Dec 01, 2022 at 10:18:59PM +0000, keith wrote:

BTS 0, TRX 0, Timeslot 5 (primary) TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, Lchan 2, Type SDCCH, State BROKEN UNUSABLE - RXL-FULL-ul: -110 dBm

I was wondering how the osmo-bts side could end up with a broken lchan, and
took a look in osmo-bts/src/osmo-bts-sysmo/oml.c (grep for BROKEN). There are
three places in osmo-bts-sysmo putting an lchan in a broken state:

  • when asking the L1 to activate a SAPI and it did not return success
  • same when asking to deactivate a SAPI and it did not return success
  • in the function lchan_deactivate_sapis() when no SAPI deactivation was queued

I haven't looked in detail, but maybe, just maybe this could come from two
activation or deactivation requests in a row? A shot in the dark guess is that
maybe osmo-bts doesn't properly guard against the wrong order of requests
coming from the BSC, attempts to carry it out and gets a NACK from L1 because
of an illogical situation? Continuing on this tangent, if an activation from
the BSC gets lost, and next we send a deactivation from the BSC, then maybe
osmo-bts-sysmo puts it through unfiltered to the L1, gets a NACK and then gives
up by placing the lchan in a broken state? Someone (I?) could investigate this
scenario with a sysmobts...

After implementing dynamic timeslots, i believe osmo-bts-sysmo should be able
to take down and restart an lchan via L1 to snap out of any broken state --
because that's what we do to switch between TCH and PDCH.

All of the above, if at all accurate, still seems to be the symptom / aftermath
of an initial unidentified cause for losing sync between BSC and BTS.

(your report looks like there is a sync that both agree to be broken, but there
is no such communication between BTS and BSC to mark a channel broken. For BSC,
it ends up broken when the BTS responds with an unexpected result; for the BTS,
it ends up broken when the L1 responds with an unexpected result or even
apparently when no SAPIs are active upon calling lchan_deactivate_sapis())

Actions #41

Updated by keith over 1 year ago

Thanks neels

Looking at this happening these past days:

  • It's always TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8
  • I didn't see it before the implementation of TCH/F_TCH/H_SDCCH8_PDCH
  • If it were an L1 issue (on sysmoBTS), that would never have shown up in any tests.

So my feeling is to investigate the L1

Actions #42

Updated by keith over 1 year ago

Just to clarify, there is a vty output posted above where osmo-bsc shows a TCH/F_TCH/H_SDCCH8_PDCH as TCH/F as B0RKEN, but you'll
notice the osmo-bts lchan list doesn't have it. So I meant with always TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8, on the bts side.

Actions #43

Updated by neels over 1 year ago

keith wrote in #note-41:

Thanks neels

Looking at this happening these past days:

  • It's always TCH/F_TCH/H_SDCCH8_PDCH as SDCCH8

pespin implemented this IIRC

  • I didn't see it before the implementation of TCH/F_TCH/H_SDCCH8_PDCH
  • If it were an L1 issue (on sysmoBTS), that would never have shown up in any tests.

So my feeling is to investigate the L1

Actions #44

Updated by keith about 1 year ago

I happened to see this yesterday at my desk, with my SysmoBTS 1002 with 1 phone as load.

Unfortunately, I'm not capturing Abis or logging on my 1002 bts by default, so I don't have anything to add, not that I think there is much to add. - it would seem that the L1/superfemto PHY has an not-understood problem with TCH/F_TCH/H_SDCCH8_PDCH

I'm considering maybe to implement a workaround for the moment something like the deprecated
dyn_ts_allow_tch_f
but rather
dyn_ts_allow_sdcch8

This is now a pretty serious bug, affecting daily operation, but I think it probably escapes all automated testing as the development track that introduced this feature, I'm guessing has been oriented towards and tested on osmo-bts-trx based platforms and that the sysmoBTS been somewhat sidelined :-(

Actions #45

Updated by keith about 1 year ago

So based on this pcap file I don't understand why the bsc-side kernel TCP stack thinks the connection has timed out. Looks like something is broken there :/

I was just doing test calls with a colleague on site. Our call dropped as osmo-bsc logs:

20230206144135662 DLINP NOTICE Sign link problems, closing socket. Reason: Connection timed out (ipaccess.c:390)        
20230206144135662 DLINP NOTICE E1TS(0:1) Forcing socket shutdown (ipaccess.c:95)                                                                                                   
20230206144135662 DLINP NOTICE E1L(0) Signalling link down (bts_ipaccess_nanobts.c:835)                                                                                            
20230206144135662 DLINP NOTICE (bts=2) Dropping OML link: link down (bts_ipaccess_nanobts.c:611)                                                                                   
20230206144135662 DLMI ERROR (bts=2,trx=0) Lost E1 OML link (osmo_bsc_main.c:450)
20230206144135663 DLMI ERROR (bts=2,trx=0) Lost E1 OSMO link (osmo_bsc_main.c:450)                                                                                                 
20230206144135663 DLINP NOTICE (bts=2,trx=0) Dropping RSL link: OML link drop (bts_ipaccess_nanobts.c:588)
20230206144135663 DLMI ERROR (bts=2,trx=0) Lost E1 RSL link (osmo_bsc_main.c:450)

Audio was steady enough right up to the drop.

This is a newly added BTS at this site, which does have a couple of ubiquiti litebeam between it and osmo-bsc. The other BTS is pure ethernet (cable) and I have not seen this erroneous "Dropping OML link: link down"

This probably needs it's own ticket at this point. probably high prio and some way to try and reproduce.
This is in fact show-stopping here on this site.. the BTS is getting dropped regularly.

Actions #46

Updated by pespin about 1 year ago

Hi keith ,

so iiuc there seems to be a link issue and the kernel most probably reports correctly that the TCP connection is gone due to keepalive or alike failing? Maybe you want to tune the TCP keepalive params? But probably you want to first do some sort of analysis of timeouts on the TCP socket ussing "ss" tool or /proc/net/tcp or alike.

I think we can tweak those per socket using VTY, check "e1_line <0-255> keepalive <1-300> <1-20> <1-300>" (libosmo-abis.git set_keepalive_params()).

Actions #47

Updated by keith about 1 year ago

Hi pespin thanks
What do you think about what I wrote in #note-28 ?

Actions #48

Updated by fixeria 4 months ago

  • Related to Feature #4537: OsmoBSC needs strategies to recover broken lchans (lchan state BORKEN) added
Actions

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)