I've analyzed the PCAP file, and figured out why this happens. More details follow soon.
First of all, I visually compared the PCAP file of the failed test run with the PCAP file of a successful run. The main (and as it turned out crucial) difference is that all CHANnel ACTIVation messages got multiplexed into a single TCP packet (nr. 8875 in the attached capture) during the failed run. I guess this happened because the Nagle's algorithm is not being disabled in the test suite.
So what happened?¶
On receipt of the first CHANnel ACTIVation message for TCH/H(0), we see the following logging:
8891 2.480253 172.18.9.20 → 172.18.9.10 GSMTAP 176 172.18.9.10:3003 connected read
8892 2.480259 172.18.9.20 → 172.18.9.10 GSMTAP 178 172.18.9.10:3003 message received
8893 2.480275 172.18.9.20 → 172.18.9.10 GSMTAP 224 (bts=0,trx=0,ts=4,pchan=TCH/F_TCH/H_SDCCH8_PDCH as NONE) ss=0 Rx RSL CHAN_ACTIV
8894 2.480283 172.18.9.20 → 172.18.9.10 GSMTAP 231 (bts=0,trx=0,ts=4,pchan=TCH/F_TCH/H_SDCCH8_PDCH switching NONE -> TCH/H) rx chan activ
8895 2.480290 172.18.9.20 → 172.18.9.10 GSMTAP 237 (bts=0,trx=0,ts=4,pchan=TCH/F_TCH/H_SDCCH8_PDCH switching NONE -> TCH/H) dyn_ts_l1_reconnect
8896 2.480298 172.18.9.20 → 172.18.9.10 GSMTAP 228 (bts=0,trx=0,ts=4,pchan=TCH/F_TCH/H_SDCCH8_PDCH switching NONE -> TCH/H) Disconnect
8897 2.480306 172.18.9.20 → 172.18.9.10 GSMTAP 230 (bts=0,trx=0,ts=4,pchan=TCH/F_TCH/H_SDCCH8_PDCH switching NONE -> TCH/H) Disconnected
8898 2.480314 172.18.9.20 → 172.18.9.10 GSMTAP 225 (bts=0,trx=0,ts=4,pchan=TCH/F_TCH/H_SDCCH8_PDCH switching NONE -> TCH/H) Connect
8899 2.480322 172.18.9.20 → 172.18.9.10 GSMTAP 200 (bts=0,trx=0,ts=4) bts_model_ts_connect(as_pchan=TCH/H)
8900 2.480335 172.18.9.20 → 172.18.9.10 GSMTAP 204 (bts=0,trx=0,ts=4) Configured multiframe with 'TCH/H+SACCH'
8901 2.480346 172.18.9.20 → 172.18.9.10 GSMTAP 226 TRX_PROV(phy0-0)[0x55b1f9f68d10]{OPEN_POWERON}: Received Event TRX_PROV_EV_CFG_TS
8902 2.480359 172.18.9.20 → 172.18.9.10 GSMTAP 200 phy0.0: Enqueuing TRX control command 'CMD SETSLOT 4 3'
8903 2.480365 172.18.9.20 → 172.18.9.10 GSMTAP 186 phy0.0: Sending control 'CMD SETSLOT 4 3'
First, we end up in rsl_rx_chan_activ() and hit the following conditional block:
if (ts->pchan == GSM_PCHAN_OSMO_DYN) {
ts->dyn.pchan_want = dyn_pchan_from_chan_nr(dch->chan_nr);
DEBUGP(DRSL, "%s rx chan activ\n", gsm_ts_and_pchan_name(ts));
if (ts->dyn.pchan_is != ts->dyn.pchan_want) { // GSM_PCHAN_NONE != GSM_PCHAN_TCH_H
/*
* The phy has the timeslot connected in a different
* mode than this activation needs it to be.
* Re-connect, then come back to rsl_rx_chan_activ().
*/
rc = dyn_ts_l1_reconnect(ts, msg);
if (rc)
return rsl_tx_chan_act_nack(lchan, RSL_ERR_NORMAL_UNSPEC);
/* indicate that the msgb should not be freed. */
return 1;
}
}
The idea here is to postpone handling of the CHANnel ACTIVation message if the timeslot is a dynamic timeslot that needs to be re-configured with different pchan type (from NONE to TCH/H in our case). The activation message is not free()d, but stored in 'struct gsm_bts_trx_ts' by dyn_ts_l1_reconnect(). And this is exactly where the problem hides!
After these logging lines, we see similar messages:
8904 2.480394 172.18.9.20 → 172.18.9.10 GSMTAP 176 172.18.9.10:3003 connected read
8905 2.480402 172.18.9.20 → 172.18.9.10 GSMTAP 178 172.18.9.10:3003 message received
8906 2.480416 172.18.9.20 → 172.18.9.10 GSMTAP 240 (bts=0,trx=0,ts=4,pchan=TCH/F_TCH/H_SDCCH8_PDCH switching NONE -> TCH/H) ss=1 Rx RSL CHAN_ACTIV
8907 2.480422 172.18.9.20 → 172.18.9.10 GSMTAP 231 (bts=0,trx=0,ts=4,pchan=TCH/F_TCH/H_SDCCH8_PDCH switching NONE -> TCH/H) rx chan activ
8908 2.480431 172.18.9.20 → 172.18.9.10 GSMTAP 237 (bts=0,trx=0,ts=4,pchan=TCH/F_TCH/H_SDCCH8_PDCH switching NONE -> TCH/H) dyn_ts_l1_reconnect
8909 2.480437 172.18.9.20 → 172.18.9.10 GSMTAP 228 (bts=0,trx=0,ts=4,pchan=TCH/F_TCH/H_SDCCH8_PDCH switching NONE -> TCH/H) Disconnect
8910 2.480444 172.18.9.20 → 172.18.9.10 GSMTAP 230 (bts=0,trx=0,ts=4,pchan=TCH/F_TCH/H_SDCCH8_PDCH switching NONE -> TCH/H) Disconnected
8911 2.480449 172.18.9.20 → 172.18.9.10 GSMTAP 225 (bts=0,trx=0,ts=4,pchan=TCH/F_TCH/H_SDCCH8_PDCH switching NONE -> TCH/H) Connect
8912 2.480456 172.18.9.20 → 172.18.9.10 GSMTAP 200 (bts=0,trx=0,ts=4) bts_model_ts_connect(as_pchan=TCH/H)
8913 2.480466 172.18.9.20 → 172.18.9.10 GSMTAP 204 (bts=0,trx=0,ts=4) Configured multiframe with 'TCH/H+SACCH'
8914 2.480476 172.18.9.20 → 172.18.9.10 GSMTAP 226 TRX_PROV(phy0-0)[0x55b1f9f68d10]{OPEN_POWERON}: Received Event TRX_PROV_EV_CFG_TS
8915 2.480482 172.18.9.20 → 172.18.9.10 GSMTAP 192 phy0.0: Not sending duplicate command 'SETSLOT'
The second CHANnel ACTIVation message for TCH/H(1) is handled before the PHY responds to the 'SETSLOT' command on TRXC, so the TS4 is still in state 'switching NONE -> TCH/H'. However, rsl_rx_chan_activ() is not smart enough to handle this state correctly, so it calls dyn_ts_l1_reconnect() again. The later does not care about the timeslot's state either, so it leaks the activation message for TCH/H(0) and stores the new one for TCH/H(1).
static int dyn_ts_l1_reconnect(struct gsm_bts_trx_ts *ts, struct msgb *msg)
{
// ...
/* We will feed this back to rsl_rx_chan_activ() later */
ts->dyn.pending_chan_activ = msg;
/* Disconnect, continue connecting from cb_ts_disconnected(). */
DEBUGP(DRSL, "%s Disconnect\n", gsm_ts_and_pchan_name(ts));
return bts_model_ts_disconnect(ts);
}
This is why we didn't get CHANnel ACTIVation ACK for TCH/H(0). The related msgb got lost when dyn_ts_l1_reconnect() was called for the second time, and thus TCH/H(0) has not been activated at all. In order to trigger this race condition, the BSC needs to be quick enough to send the second CHANnel ACTIVation before the PHY confirms reconfiguration of the timeslot. Very unlikely in a real RAN setup, but still possible.