Project

General

Profile

Feature #4537

OsmoBSC needs strategies to recover broken lchans (lchan state BORKEN)

Added by neels 27 days ago. Updated 15 days ago.

Status:
In Progress
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Start date:
05/08/2020
Due date:
% Done:

30%

Spec Reference:

Description

Currently, there are some stuations where broken lchans stick around in osmo-bsc without being recovered.
The reason is that we cannot be sure what state the BTS is in, which may be distinct for each individual BTS model.

There are various reasons for an lchan to reach a broken state:

{act, deact} x {timeout, got a nack} x {for CS, for PS}

Each one of those potentially have distinct ways that the BSC should/could try to recover.

For example:

  • after a while try to chan activ the lchan (to probe, not for a subscriber request),
  • if the BTS accepts the activation, then deact again and the lchan becomes usable again.
  • if the activ didn't work, then try to deact. If that succeeds, the lchan becomes usable again.

There are also other, more general approaches to try to recover:

  • if a BTS has one or more broken lchans, and there comes a moment where all lchans are unused, drop the OML link to cause a restart of the BTS. After that, all lchans are reset to a clean state.
  • ...?

An important aspect: currently, the BSC picks a free lchan, and if that fails, the subscriber gets kicked out. We don't retry with another lchan. So, we cannot risk marking a channel as UNUSED when it is in fact broken: since we always pick the first lchan, if it is broken under the hood, every subscriber gets kicked out, and the entire BTS could become unusable as soon as the first broken lchan shows up.

We could introduce a way how osmo-bsc retries establishing a different lchan for a subscriber if the first one fails.
That would make it less dangerous to have a broken state in the BTS, but letting the BSC try that lchan anyway.
However, this should still remain distinguishable from a clean UNUSED lchan that never had a problem, so that constantly unrecoverable lchans are visible in the BSC's "show lchan summary".

A compound solution of the above could be the gold standard:

  • if establishing an lchan fails, try another one instead of dropping the subscriber.
    That could be nicer for subscribers when first hitting a broken lchan.
    (counter argument: does the MS anyway retry establishing an lchan again?
    Maybe for first access, but finding an lchan for voice call assignment could retry different lchans.)
  • an lchan that failed should remain in a broken state for a minimum short time (T3111?).
  • when in a broken state, the BSC should regularly try to send CHAN ACTIV and/or CHAN DEACT messages to probe whether the BTS responds with a sane state (probably depending on how the broken state was reached).
  • Since recovering could still fail, the BSC should notice when a BTS that has lchans that remain broken for a given period of time.
    It should then try to reset the BTS completely (drop OML) when it reaches a moment of no lchans being in use.

These approaches need to be tested on all supported BTS models,
and ideally should be configurable per-bts in the osmo-bsc.cfg.


Related issues

Related to OsmoBSC - Bug #4166: BORKEN Error / broken lchan / WAIT_ACTIV_ACK: TimeoutResolved08/22/2019

Related to OsmoBSC - Feature #4540: ensure there are counters indicating lchans entering and leaving the broken state (BORKEN)Resolved05/09/2020

Related to OsmoBSC - Feature #4541: add ttcn3 tests for situations of Abis latency spikes, where chan act/deact ACK messages arrive lateNew05/09/2020

History

#1 Updated by neels 27 days ago

related: I turned down https://gerrit.osmocom.org/c/osmo-bsc/+/11958/ because going back to UNUSED straight away could end up rejecting all subscribers (if the lchan is in fact broken but marked as usable).

#2 Updated by neels 27 days ago

  • Related to Bug #4166: BORKEN Error / broken lchan / WAIT_ACTIV_ACK: Timeout added

#3 Updated by neels 27 days ago

turns out that some recovery from broken lchans due to late ACK messages was not working.
Fixes merged: https://gerrit.osmocom.org/c/osmo-bsc/+/18170 https://gerrit.osmocom.org/c/osmo-bsc/+/18171

#4 Updated by neels 27 days ago

  • Related to Feature #4540: ensure there are counters indicating lchans entering and leaving the broken state (BORKEN) added

#5 Updated by neels 27 days ago

  • Related to Feature #4541: add ttcn3 tests for situations of Abis latency spikes, where chan act/deact ACK messages arrive late added

#6 Updated by neels 15 days ago

  • Status changed from New to In Progress
  • % Done changed from 0 to 30

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)