Project

General

Profile

Actions

Bug #5245

closed

Race condition during CHANnel ACTivation on dynamic timeslots

Added by fixeria over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
dynamic PDCH
Target version:
-
Start date:
10/01/2021
Due date:
% Done:

100%

Spec Reference:

Description

While investigating a sporadic test case failure [1] in ttcn3-bts-test, I found a serious problem.

[1] https://jenkins.osmocom.org/jenkins/view/TTCN3/job/ttcn3-bts-test/1422/testReport/junit/(root)/BTS_Tests_VAMOS/TC_vamos_chan_act_dyn_osmo_vhh/

Stacktrace

"BTS_Tests.ttcn:755 : Timeout expecting RSL CHAN ACT" 
      BTS_Tests_VAMOS.ttcn:318 BTS_Tests_VAMOS control part
      BTS_Tests_VAMOS.ttcn:225 TC_vamos_chan_act_dyn_osmo_vhh testcase

The test case spawns several concurrent components, each of which is activating a logical channel of type TCH/H on TS4. The TS4 is an Osmocom style dynamic timeslot (TCH/F+TCH/H+SDCCH+PDCH) and can serve up to 4 active subscribers in VAMOS mode - this is what TC_vamos_chan_act_dyn_osmo_vhh is aimed to verify. Therefore on the A-bis/RSL we expect to see 4 CHANnel ACTivation messages and 4 CHANnel RELease messages, as well as the associated ACKs. However, as can be seen in the capture file (see attached), one CHANnel ACTivation ACK is missing. This is why the test case fails.

I've analyzed the PCAP file, and figured out why this happens. More details follow soon.


Files

Actions #1

Updated by fixeria over 2 years ago

  • Status changed from New to In Progress
  • % Done changed from 0 to 40

I've analyzed the PCAP file, and figured out why this happens. More details follow soon.

First of all, I visually compared the PCAP file of the failed test run with the PCAP file of a successful run. The main (and as it turned out crucial) difference is that all CHANnel ACTIVation messages got multiplexed into a single TCP packet (nr. 8875 in the attached capture) during the failed run. I guess this happened because the Nagle's algorithm is not being disabled in the test suite.

So what happened?

On receipt of the first CHANnel ACTIVation message for TCH/H(0), we see the following logging:

 8891   2.480253  172.18.9.20 → 172.18.9.10  GSMTAP 176 172.18.9.10:3003 connected read 
 8892   2.480259  172.18.9.20 → 172.18.9.10  GSMTAP 178 172.18.9.10:3003 message received 
 8893   2.480275  172.18.9.20 → 172.18.9.10  GSMTAP 224 (bts=0,trx=0,ts=4,pchan=TCH/F_TCH/H_SDCCH8_PDCH as NONE) ss=0 Rx RSL CHAN_ACTIV 
 8894   2.480283  172.18.9.20 → 172.18.9.10  GSMTAP 231 (bts=0,trx=0,ts=4,pchan=TCH/F_TCH/H_SDCCH8_PDCH switching NONE -> TCH/H) rx chan activ 
 8895   2.480290  172.18.9.20 → 172.18.9.10  GSMTAP 237 (bts=0,trx=0,ts=4,pchan=TCH/F_TCH/H_SDCCH8_PDCH switching NONE -> TCH/H) dyn_ts_l1_reconnect 
 8896   2.480298  172.18.9.20 → 172.18.9.10  GSMTAP 228 (bts=0,trx=0,ts=4,pchan=TCH/F_TCH/H_SDCCH8_PDCH switching NONE -> TCH/H) Disconnect 
 8897   2.480306  172.18.9.20 → 172.18.9.10  GSMTAP 230 (bts=0,trx=0,ts=4,pchan=TCH/F_TCH/H_SDCCH8_PDCH switching NONE -> TCH/H) Disconnected 
 8898   2.480314  172.18.9.20 → 172.18.9.10  GSMTAP 225 (bts=0,trx=0,ts=4,pchan=TCH/F_TCH/H_SDCCH8_PDCH switching NONE -> TCH/H) Connect 
 8899   2.480322  172.18.9.20 → 172.18.9.10  GSMTAP 200 (bts=0,trx=0,ts=4) bts_model_ts_connect(as_pchan=TCH/H) 
 8900   2.480335  172.18.9.20 → 172.18.9.10  GSMTAP 204 (bts=0,trx=0,ts=4) Configured multiframe with 'TCH/H+SACCH' 
 8901   2.480346  172.18.9.20 → 172.18.9.10  GSMTAP 226 TRX_PROV(phy0-0)[0x55b1f9f68d10]{OPEN_POWERON}: Received Event TRX_PROV_EV_CFG_TS 
 8902   2.480359  172.18.9.20 → 172.18.9.10  GSMTAP 200 phy0.0: Enqueuing TRX control command 'CMD SETSLOT 4 3' 
 8903   2.480365  172.18.9.20 → 172.18.9.10  GSMTAP 186 phy0.0: Sending control 'CMD SETSLOT 4 3'

First, we end up in rsl_rx_chan_activ() and hit the following conditional block:

if (ts->pchan == GSM_PCHAN_OSMO_DYN) {
        ts->dyn.pchan_want = dyn_pchan_from_chan_nr(dch->chan_nr);
        DEBUGP(DRSL, "%s rx chan activ\n", gsm_ts_and_pchan_name(ts));

        if (ts->dyn.pchan_is != ts->dyn.pchan_want) { // GSM_PCHAN_NONE != GSM_PCHAN_TCH_H
                /*
                 * The phy has the timeslot connected in a different
                 * mode than this activation needs it to be.
                 * Re-connect, then come back to rsl_rx_chan_activ().
                 */
                rc = dyn_ts_l1_reconnect(ts, msg);
                if (rc)
                        return rsl_tx_chan_act_nack(lchan, RSL_ERR_NORMAL_UNSPEC);
                /* indicate that the msgb should not be freed. */
                return 1;
        }
}

The idea here is to postpone handling of the CHANnel ACTIVation message if the timeslot is a dynamic timeslot that needs to be re-configured with different pchan type (from NONE to TCH/H in our case). The activation message is not free()d, but stored in 'struct gsm_bts_trx_ts' by dyn_ts_l1_reconnect(). And this is exactly where the problem hides!

After these logging lines, we see similar messages:

 8904   2.480394  172.18.9.20 → 172.18.9.10  GSMTAP 176 172.18.9.10:3003 connected read 
 8905   2.480402  172.18.9.20 → 172.18.9.10  GSMTAP 178 172.18.9.10:3003 message received 
 8906   2.480416  172.18.9.20 → 172.18.9.10  GSMTAP 240 (bts=0,trx=0,ts=4,pchan=TCH/F_TCH/H_SDCCH8_PDCH switching NONE -> TCH/H) ss=1 Rx RSL CHAN_ACTIV 
 8907   2.480422  172.18.9.20 → 172.18.9.10  GSMTAP 231 (bts=0,trx=0,ts=4,pchan=TCH/F_TCH/H_SDCCH8_PDCH switching NONE -> TCH/H) rx chan activ 
 8908   2.480431  172.18.9.20 → 172.18.9.10  GSMTAP 237 (bts=0,trx=0,ts=4,pchan=TCH/F_TCH/H_SDCCH8_PDCH switching NONE -> TCH/H) dyn_ts_l1_reconnect 
 8909   2.480437  172.18.9.20 → 172.18.9.10  GSMTAP 228 (bts=0,trx=0,ts=4,pchan=TCH/F_TCH/H_SDCCH8_PDCH switching NONE -> TCH/H) Disconnect 
 8910   2.480444  172.18.9.20 → 172.18.9.10  GSMTAP 230 (bts=0,trx=0,ts=4,pchan=TCH/F_TCH/H_SDCCH8_PDCH switching NONE -> TCH/H) Disconnected 
 8911   2.480449  172.18.9.20 → 172.18.9.10  GSMTAP 225 (bts=0,trx=0,ts=4,pchan=TCH/F_TCH/H_SDCCH8_PDCH switching NONE -> TCH/H) Connect 
 8912   2.480456  172.18.9.20 → 172.18.9.10  GSMTAP 200 (bts=0,trx=0,ts=4) bts_model_ts_connect(as_pchan=TCH/H) 
 8913   2.480466  172.18.9.20 → 172.18.9.10  GSMTAP 204 (bts=0,trx=0,ts=4) Configured multiframe with 'TCH/H+SACCH' 
 8914   2.480476  172.18.9.20 → 172.18.9.10  GSMTAP 226 TRX_PROV(phy0-0)[0x55b1f9f68d10]{OPEN_POWERON}: Received Event TRX_PROV_EV_CFG_TS 
 8915   2.480482  172.18.9.20 → 172.18.9.10  GSMTAP 192 phy0.0: Not sending duplicate command 'SETSLOT'

The second CHANnel ACTIVation message for TCH/H(1) is handled before the PHY responds to the 'SETSLOT' command on TRXC, so the TS4 is still in state 'switching NONE -> TCH/H'. However, rsl_rx_chan_activ() is not smart enough to handle this state correctly, so it calls dyn_ts_l1_reconnect() again. The later does not care about the timeslot's state either, so it leaks the activation message for TCH/H(0) and stores the new one for TCH/H(1).

static int dyn_ts_l1_reconnect(struct gsm_bts_trx_ts *ts, struct msgb *msg)
{
        // ...

        /* We will feed this back to rsl_rx_chan_activ() later */
        ts->dyn.pending_chan_activ = msg; 

        /* Disconnect, continue connecting from cb_ts_disconnected(). */
        DEBUGP(DRSL, "%s Disconnect\n", gsm_ts_and_pchan_name(ts));
        return bts_model_ts_disconnect(ts);
}

This is why we didn't get CHANnel ACTIVation ACK for TCH/H(0). The related msgb got lost when dyn_ts_l1_reconnect() was called for the second time, and thus TCH/H(0) has not been activated at all. In order to trigger this race condition, the BSC needs to be quick enough to send the second CHANnel ACTIVation before the PHY confirms reconfiguration of the timeslot. Very unlikely in a real RAN setup, but still possible.

Actions #2

Updated by fixeria over 2 years ago

  • % Done changed from 40 to 60

I've managed to reproduce this race condition reliably using ttcn3-bts-test:

https://gerrit.osmocom.org/c/osmocom-bb/+/25667 trx_toolkit: support setting artificial delay for TRXC [NEW]
https://gerrit.osmocom.org/c/osmo-ttcn3-hacks/+/25668 f_TRXC_transceive(): accept 'template (value)' parameter [NEW]
https://gerrit.osmocom.org/c/osmo-ttcn3-hacks/+/25669 library/TRXC_Types: add definitions for FAKE_TRXC_DELAY [NEW]
https://gerrit.osmocom.org/c/osmo-ttcn3-hacks/+/25670 BTS: reproduce a race condition described in OS#5245 [NEW]

This is achieved by adding an artificial delay to the TRXC logic of fake_trx.py.

Actions #3

Updated by fixeria over 2 years ago

  • Status changed from In Progress to Feedback
  • % Done changed from 60 to 80

Here we go, this patch fixes the channel activation logic:

https://gerrit.osmocom.org/c/osmo-bts/+/25674 rsl: prevent race condition during timeslot re-configuration [NEW]

The key idea is to store CHANnel ACTIVation messages in 'struct gsm_lchan' instead of 'struct gsm_bts_trx_ts', so that several RSL messages can be preserved and handled once the PHY is ready.

While working on it, I submitted some additional changes:

https://gerrit.osmocom.org/c/osmo-bts/+/25671 abis: fix memory leak in abis_oml_sendmsg() [NEW]
https://gerrit.osmocom.org/c/osmo-bts/+/25672 abis: reduce logging verbosity in abis_oml_sendmsg() [NEW]
https://gerrit.osmocom.org/c/osmo-bts/+/25673 rsl: remove redundant logging in rsl_rx_chan_activ() [NEW]

Actions #4

Updated by fixeria over 2 years ago

  • Status changed from Feedback to Resolved
  • % Done changed from 80 to 100
Actions

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)