BSC sends COMPLETE L3 before RESET
At least in SCCPlite, we've received a protocol trace from a customer that looks like this:
- IPA CCM handshake
- SCCP CR with BSSMAP COMPLETE L3 INFO
- another SCCP CR with BSSMAP COMPLETE L3 INFO
- only then a SCCP UDT with BSSMAP RESET
The Reset procedure should happen as the first thing after the A link comes up, before any user data is communicated. The SCCP CR messages of the example above should ideally be queued (or else discarded) until the RESET procedure completes. Discarding is probably the easy option, as queueing would have to involve timeouts (what if the RESET takes 5 minutes to complete), ...
https://gerrit.osmocom.org/c/libosmo-netif/+/15403 stream: Introduce API osmo_stream_cli_is_connected
https://gerrit.osmocom.org/c/libosmo-netif/+/15404 stream: Fix scheduling of queued messages during connecting state
- % Done changed from 0 to 60
https://gerrit.osmocom.org/c/libosmo-sccp/+/15405 ss7: Do not queue messages if stream is not connected
Helpful call stack:
sccp_sclc_user_sap_down_nofree xua_gen_encode_and_send xua_gen_msg_cl sccp_scrc_rx_sclc_msg sua_addr_parse scrc_local_out_common scrc_node_12 gen_mtp_transfer_req_xua sua2sccp_tx_m3ua osmo_ss7_user_mtp_xfer_req m3ua_hmdc_rx_from_l2 hmrt_message_for_routing ipa_tx_xua_as xua_as_transmit_msg osmo_ss7_asp_send osmo_stream_cli_send/osmo_stream_srv_send
- Category set to A interface
- Status changed from In Progress to Feedback
- % Done changed from 60 to 70
More related commits:
remote: https://gerrit.osmocom.org/c/osmo-bsc/+/15406 a_reset.c: Don't wait 2 seconds to send first BSSMAP RESET
remote: https://gerrit.osmocom.org/c/osmo-bsc/+/15407 bsc: gsm_08_08.c: Remove repeated conn not null check
I could not find the exact culprit of the issue, according to what I understand from the code it should not happen at all. I think it may happen if the BSC<->MSC conn was already established at some previous point, and then it got restarted without the BSC not yet knowing about it, so upper layers still think the conn is active and so those CL3 Info messages can be sent. And since those are not answered, at some point this condition from a_reset.c triggers, sending the BSSAP reset:
if (reset_ctx->conn_loss_counter >= BAD_CONNECTION_THRESOLD)
But I'm just speculating, it's difficult to say because the bsc logs related to the pcap file don't match (eg. the src port of the connection and timestamps differ), so it's almost impossible to know exactly what's going on since I also lack previous context in the pcap file.
I think the best is to stall this ticket and once the fixes above submitted are merged, try again and get more data to better figure out the issue.