Project

General

Profile

Bug #2380

logical channels can get stuck if BTS sends no RELEASE INDICATION for SACCH / T3109 can be disabled

Added by laforge 6 months ago. Updated 5 months ago.

Status:
Closed
Priority:
High
Assignee:
Category:
libbsc
Target version:
-
Start date:
07/19/2017
Due date:
% Done:

100%

Resolution:
Spec Reference:

Description

Every so often we received bug reports from users indicating "stuck" logical channels in the BSC, which accumulat over time and don't get automatically recycled. Unfortunately I cannot seem to find any issue here in redmine about it, but I just have experienced something similar here during testing with osmo-bts-virtual, virtphy and my L1CTL/LAPDm test cases from osmo-ttcn3-hacks.

From the BSC side, formerly-used channels would still show up in "show lchan", even on a completely idle cell:

OpenBSC> show lchan
BTS 0, TRX 0, Timeslot 0, Lchan 1: Type SDCCH
  Connection: 0, State: RELEASE REQUESTED
  BS Power: 3 dBm, MS Power: 36 dBm
  Channel Mode / Codec: signalling
  No Subscriber
  Bound IP: 0.0.0.0 Port 0 RTP_TYPE2=0 CONN_ID=0
  Measurement Report:
    Flags: DLinval
    RXL-FULL-ul: -110 dBm, RXL-SUB-ul: -110 dBm RXQ-FULL-ul: 0, RXQ-SUB-ul: 0

I can not trigger this with a well-behaving MS (e.g. OsmocomBB performing a LU). In that case, the lchan is released as expected (rsl_lu_normal.pcapng)

However, from a L1CTL based test case that doesn't properly establish LAPDm on the radio channel, it can be reproduced quite easily ( (rsl_est_fefe_fail.pcapng).

Initial comparison of the pcap traces shows that the erroneous case (causing stuck lchan) doesn't ever show a RELEASE INDICATION from BTS to BSC. This is not a surprise, as there is no functional LAPDm entity in the test case, and hence there is no established SACCH that could be released.

rsl_lu_normal.pcapng - RSL trace with normal LU behaving properly (2.23 KB) laforge, 07/19/2017 02:32 PM

rsl_est_fefe_fail.pcapng - RSL trace without proper LAPDm instance causing stuck channel (1.38 KB) laforge, 07/19/2017 02:32 PM


Related issues

Related to OpenBSC - Bug #49: T3109 not implemented Closed
Related to OpenBSC - Bug #50: Abnormal channel release handling is wrong Closed
Related to OsmoBSC - Bug #1610: RF channel release skipping T3109 New 02/23/2016

History

#1 Updated by laforge 6 months ago

  • Related to Bug #49: T3109 not implemented added

#2 Updated by laforge 6 months ago

  • Related to Bug #50: Abnormal channel release handling is wrong added

#3 Updated by laforge 6 months ago

  • Related to Bug #1610: RF channel release skipping T3109 added

#4 Updated by laforge 6 months ago

  • Status changed from New to In Progress

While the test case without any real LAPDm instance on the MS side is of course fairly artificial, this can of course also occur in normal operation, in at least the following cases:

  • MS runs out of coverage, out of power or reboots before being able to send LAPDm DISC (genreating RELEASE INDICATION at BTS)
  • radio interference at the time of the LAPDm DISC
  • malicious or buggy MS

This might explain some very sporadic failure cases in real-world setups

#5 Updated by neels 6 months ago

  • Description updated (diff)

#6 Updated by laforge 6 months ago

  • % Done changed from 0 to 50

The problem was that in the specific config file I was using T3109 was set to 0. And without T3109 being ever started, the erroneous channel release procedure doesn't complete. I still think this is a bug. we either should reject a value of zero, or otherwise remain in a clean / sane state, no matter what the user configures.

#7 Updated by laforge 6 months ago

  • Subject changed from logical channels can get stuck if BTS sends no RELEASE INDICATION for SACCH to logical channels can get stuck if BTS sends no RELEASE INDICATION for SACCH / T3109 can be disabled
  • % Done changed from 50 to 80

#8 Updated by laforge 6 months ago

Please note that even the osmo-gsm-tester config file templates suffer from this, see https://gerrit.osmocom.org/3308

#9 Updated by laforge 6 months ago

all three related gerrit changes merged.

#10 Updated by ipse 6 months ago

My personal thank you for fixing this. It's been a very annoying issue in our installations.

#11 Updated by laforge 6 months ago

On Thu, Jul 20, 2017 at 10:27:16AM +0000, ipse [REDMINE] wrote:

My personal thank you for fixing this. It's been a very annoying issue in our installations.

Did you have T3109 set to 0 in your config? If not, it might be another
bug, so don't thank me too soon.

It's of course always appreciated if you could help us expanding the
test coverage :)

#12 Updated by zecke 6 months ago

On 20. Jul 2017, at 01:22, laforge [REDMINE] <> wrote:

Issue #2380 has been updated by laforge.

% Done changed from 0 to 50

The problem was that in the specific config file I was using T3109 was set to 0. And without T3109 being ever started, the erroneous channel release procedure doesn't complete. I still think this is a bug. we either should reject a value of zero, or otherwise remain in a clean / sane state, no matter what the user configures.

Keeping this 0 (== not do anything) behavior was a bad compatibility decision... IIRC we (I) made timers configurable before we implemented them and I didn't want to reject t3109 0. :(

But there is a (delayed) SACCH release indication issue with the lchan code. I posted twice to the ML but won't have time to fix it right now (SACCH release indication arriving after the timeout will do bad things).

#13 Updated by laforge 6 months ago

  • Status changed from In Progress to Closed
  • % Done changed from 80 to 100

#14 Updated by ipse 6 months ago

laforge wrote:

On Thu, Jul 20, 2017 at 10:27:16AM +0000, ipse [REDMINE] wrote:

My personal thank you for fixing this. It's been a very annoying issue in our installations.

Did you have T3109 set to 0 in your config? If not, it might be another
bug, so don't thank me too soon.

Yes, they were set to 0, because we copy-pasted that part of the config from examples. I've manually changed them to the defaults from your patch - we'll see if we see stuck channels again in the next few days.

#15 Updated by ipse 6 months ago

We've monitored our installation were we changed the timer values to the reasonable defaults and amount of stuck channels seem to be better but is not 100% gone. After a surge in traffic we've got one BTS which has 2 stuck channels. BTS is still running with those channels stuck - let me know if you want me to provide some more debugging info about it.

PS I didn't restart osmo-nitb or osmo-bts's after setting timer values, as I assume they are applied immediately to new channels. Is that correct?

#16 Updated by ipse 5 months ago

We've just got another stuck channel in the following state, so it looks like the issue is not completely resolved:

OpenBSC> show lchan 6 1 0 
BTS 6, TRX 1, Timeslot 0, Lchan 0: Type NONE
  Connection: 0, State: BROKEN UNUSABLE Error reason: activation timeout
  BS Power: 35 dBm, MS Power: 33 dBm
  Channel Mode / Codec: signalling
  No Subscriber
  Bound IP: 0.0.0.0 Port 0 RTP_TYPE2=0 CONN_ID=0
  Measurement Report:
    Flags: DLinval 
    RXL-FULL-ul: -110 dBm, RXL-SUB-ul: -110 dBm RXQ-FULL-ul: 0, RXQ-SUB-ul: 0

Interestingly, so far I've only seen TCH channels stuck and I've never seen SDCCH channels stuck.

I tried to analyze whether something else was happening at the time of the issue and it looks like another BTS had a spike in traffic making both of its 2 TRX saturated AND we had relatively high load average on the server at the same time (1.2 vs normal 0.2-0.5) which makes me think it's IO related. I think what may be happening is that a (relatively) high time waiting for our SQLite DB may have led to a timeout and stuck the channel. Not sure how realistic this is.

Happy to run some more debugging or provide NITB/BTS logs if there is any interest in fixing this.

#17 Updated by ipse 5 months ago

PS I also see an OML/RSL fail at around the same time, so I wonder if this is related. I failed to find this moment in the logs after a quick look, though, so can't correlate. =|

#18 Updated by zecke 5 months ago

@ipse/all: Please see: http://lists.osmocom.org/pipermail/openbsc/2017-April/010560.html.

One issue is if the SACCH deactivation ack arrives after the timeout (e.g. because it is delayed through TCP) we will try to release the channel again and as it fails mark the channel as broken. It should only be a couple of hours of work (it can be easily simulated with my Smalltalk fakebts code or newer approaches, ttcn-3?) but sadly the combination of day job + commute keep me busy enough.

Alex, I hope that fairwaves finds the time to move the lchan to use a FSM or can sponsor that work? I am afraid that my non OpenBSC/Osmocom paid work and commute don't leave too much time to fix such issues.

#19 Updated by keith 5 months ago

Rhizomatica installations have had

    timer t3109 4

for as long as I have been monitoring this issue.

After patches from zecke a few months back, we now recover all stuck or "broken" channels except for those resulting from NACK on activation.
I currently use a cronjob to wait for a quiet moment and then drop the bts when these accumulate.
I found it hard to impossible to reproduce anything in the production environment, or to track down any identifiable cause, so I kind of gave up on it for the moment.

Also available in: Atom PDF