Project

General

Profile

Bug #3782

OML bringup fails for osmo-bts-oc2g on high latency links

Added by dexter 16 days ago. Updated 9 days ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
02/06/2019
Due date:
% Done:

20%

Spec Reference:

Description

When osmo-bts-oc2g is used in lab setups with low latency links the OML bringup works without any problems. However, when there is a high latency link between BTS and BSC, then the OML bringup fails. The symptoms are not obvious, on the first look the bringup seems to work fine, but the BTS does not start transmitting and closer inspection of the OML related LOG output reveals that there are problems.


Related issues

Related to OsmoBSC - Feature #2470: Proper OML MO (managed object) using osmo_fsmNew2017-08-29

Related to OsmoBTS - Feature #2469: Proper OML MO (managed object) using osmo_fsmNew2017-08-29

Related to OsmoBTS - Bug #3789: BTS permits OPSTART of MO without attributesIn Progress2019-02-07

History

#1 Updated by dexter 16 days ago

Attached one finds a tar file with traces and logs. One with a direct connection where the OML bringup is fine and one with 60ms latency introduced by a latency generator where the OML bringup fails.

In the log there is one distinct symtom to observe in the bts logs. In the no latency case we can see:

Wed Feb  6 08:42:49 2019 <0006> ../../../git/src/osmo-bts-oc2g/oml.c:441 Init TRX (Band 1, ARFCN 123, TSC 7, RxPower -75.000000 dBm, TxPower  0.00 dBm

which looks ok, but when latency is introduced we can see:

Wed Feb  6 08:49:42 2019 <0006> ../../../git/src/osmo-bts-oc2g/oml.c:441 Init TRX (Band 1, ARFCN 0, TSC 0, RxPower -75.000000 dBm, TxPower  0.00 dBm

This is odd, presumably the dependency between certain OML operations is violated and some previous operation did not set required parameters. When comparing the traces one can see that there is a Set BTS Attributes message right at the beginning This message sets at least ARFCN. This message seems to be missing in the latency trace. There the first operations seem to be Set Radio Carrier Attributes. However, wireshark complains [TCP Previous segment not captured] but we can see that there is a Set BTS Attributes done at the end of the trace. (The OsmoBTS Abis Protocol Specification [1] specifies to do the Set BTS Attributes immediately after the software activate.)

Tests with other BTSs do not show the described problem. Tests with osmo-bts-sysmo and osmo-bts-trx went fine.

[1] http://ftp.osmocom.org/docs/latest/osmobts-abis.pdf

#2 Updated by laforge 16 days ago

  • Related to Feature #2470: Proper OML MO (managed object) using osmo_fsm added

#3 Updated by laforge 16 days ago

  • Related to Feature #2469: Proper OML MO (managed object) using osmo_fsm added

#4 Updated by laforge 16 days ago

#5 Updated by laforge 16 days ago

I've used this bug to re-prioritize TTCN-3 support for Abis OML.
I'm working on a codec and related send/receive templates now, which
should allow us (once complated) to test the OML implementation with
fine-grained unit tests both on the BTS and on the BSC side. However,
OML is a rather complex protocol in itself, and I don't expect
super-quick progress, at least not for larger coverage.

Regards,
Harald

#6 Updated by laforge 14 days ago

  • Related to Bug #3789: BTS permits OPSTART of MO without attributes added

#7 Updated by laforge 14 days ago

  • Status changed from New to In Progress

One thing that also strikes me odd is the fact that in the high-latency link the RSL initialization happens already before all the OML MO's are brought up.

It can be seen clearly that SI Type 1 is set in packet 25, at a time where basically none of the MOs (BTS, Radio Carrier, Channels) are OPSTARTed or Unlocked yet.

This is a violation of the Abis OML spec, since as long as the objects are not unlocked, no signalling on either terrestrial nor RF side must happen.

That particular aspect is likely a bug in OsmoBSC which starts to bring up RSL too soon.

#8 Updated by laforge 14 days ago

laforge wrote:

I've used this bug to re-prioritize TTCN-3 support for Abis OML.
I'm working on a codec and related send/receive templates now, which
should allow us (once complated) to test the OML implementation with
fine-grained unit tests both on the BTS and on the BSC side.

I've pushed the initial code to the laforge/oml branch at http://git.osmocom.org/osmo-ttcn3-hacks/log/?h=laforge/oml

#9 Updated by dexter 10 days ago

I had a closer look at osmobts-abis.pdf today and tried to match it against the traces I have from osmo-bts-sysmo and osmo-bts-oc2g. There seem to be a lot of differences between the document and the behavior of the actual implementation. The order in which the MOs are starting up seems to be somewhat confused. I see radio carrier OPSTARTS at the beginning because the radio carrier MO is not locked at startup and other inconsistencies. It seems that do not do software activate. Our software is just active at the beginning so all MOs report their state at once, while the spec specifies that the MOs are activated one by one in order.

I must say osmobts-abis.pdf is very logical written and I understand now the relations between the MOs, however I am not yet sure how to match this to the Behavior I am observing.

Regarding the startup state I think osmo-bts should report all MOs as locked and the radio carrier MO must not be unlocked by default as this triggers an early OPSTAR for the radio carrier way to early. Then it should vist the MOs one by one in order, starting at the site manager MO, continuing with the BTS mo and the going over TRX and radio carrier MO and at the very end over the channel MOs

#10 Updated by laforge 10 days ago

On Mon, Feb 11, 2019 at 08:41:02PM +0000, dexter [REDMINE] wrote:

I had a closer look at osmobts-abis.pdf today and tried to match it against the traces I have from osmo-bts-sysmo and osmo-bts-oc2g. There seem to be a lot of differences between the document and the behavior of the actual implementation.

The document describes the interaction between a nanoBTS and OsmoBSC. So OsmoBTS should be very closely
matching what's described in the document.

However, OsmoBTS came much later and only did a half-ass job at mimicing whatever minimum neccessary
to talk to OsmoBSC. That's where likely the differences come from, at least most of them.

The order in which the MOs are starting up seems to be somewhat confused. I see radio carrier OPSTARTS at the beginning because the radio carrier MO is not locked at startup

there's a separate bug for it, and it's a single-line change to make sure the radio carrier is locked
at start-up. Maybe it's worth to give this a shot?

It seems that do not do software activate. Our software is just active at the beginning so all MOs report their state at once, while the spec specifies that the MOs are activated one by one in order.

yes, but this is defined by the software architecture (one osmo-bts binary which serves all parts
of the BTS) and also clearly stated in the osmobts-abis.pdf manual.

#11 Updated by dexter 9 days ago

I have investigated the problem now further and it is indeed a race condition. From what I can see it should affect all BTS models, but I think it does not because their TRX characteristics are different so that the TRX initalization takes slightly longer then with the oc2g and therefore the race condition has no effect.

From what I can see osmo-bsc has no FSM that controls the OML bringup. All whats happening is that OML messages are received and some longer or shorter cascades of OML responses are triggered. As long as the BTS sends the OML messages in order and with the correct delay everything is fine. However. When we introduce latency the OML messages from the BTS start to pile up in almost a signle TCP packet and then suddenly flood the bsc, which triggers many OML operations at once. And so it happens that the initialization of the radio carrier MO overtakes the BTS MO. And this then leads into the erroneous behavior.

Lets take a closer look at how exactly the radio carrier MO is handled. When osmo-bts starts, the following call-graph is set of. The call to phy_links_open() is made before the program enters the select loop. This means we return at li1f_reset() and then enter the select loop and then the remaining cascade is processed. All this happens very fast and we get the radio carrier sw activate report very quickly so the software activate report piles up in packet 2 along with all the state change reports.

main.c:phy_links_open()
    ==> phy_link.c:phy_links_open()
        ==> l1_if.c:bts_model_phy_link_open()
            ==> l1_if.c:l1if_reset()
                ==> l1_if.c:reset_compl_cb()
                    ==> l1_if.c:l1if_get_info()
                        ==> l1_if.c:info_compl_cb()
                            ==> l1_if.c:l1if_activate_rf()
                                ==> l1_if.c:activate_rf_compl_cb()
                                    ==> oml_mo_tx_sw_act_rep(trx->mo)
                                    ==> oml_mo_tx_sw_act_rep(trx->bb_transc.mo)
                                    <OML BSC=>BTS: RADIO CARRIER SW ACTIVATE REPORT>

osmo-bsc now processes the pile of incoming OML messages and when it reaches the SW activate report for the radio carrier it starts a cascade which unblocks the MO and sends an opstart. This is way to early since the BTS attributes which are necessary for this step are not set yet.

<OML: RADIO CARRIER SW ACTIVATE REPORT>
bts_ipaccess_nanobts.c:bts_ipa_nm_sig_cb()
    ==> bts_ipaccess_nanobts.c:sw_activ_rep()
        ==> bts_ipaccess_nanobts.c:nm_set_radio_attr()
        ==> abis_nm.c:abis_nm_chg_adm_state()
        <OML BSC=>BTS: RADIO CARRIER CHANGE ADMINISTRATIVE STATE UNBLOCKED>
        <OML BTS=>BSC: RADIO CARRIER CHANGE ADMINISTRATIVE STATE UNBLOCKED ACK>        
        ==> abis_nm_opstart()
        <OML BSC=>BTS: RADIO CARRIER OPSTART> <======= TOO EARLY!

I did not fully investigate the cascades required for the Set BTS Attributes, but I think those are more complex and therefore take longer and that this is the reason why we see the Set BTS Attributes so late compared to the Radio Carrier Opstart.

From what I can see is that solving the problem would mean to implement an FSM for each MO and maybe use those as childs in one site-manager FSM. In order to check if my assumption is right I have done a small hack in the phy_links_open() to delay it. After this hack osmo-bts were able to pass the latency simlator. So its possible to fix this quickly, but thats certainly not a clean solution. See also: pmaier/omlhack

I have also tried to set the radio carrier MO to locked at the beginning, this seems not to make any difference.

#12 Updated by dexter 9 days ago

  • % Done changed from 0 to 20

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)