Bug #4714
closedSMS Message Lost Although Delivery Ack Sent
100%
Description
Hi! I've run into a strange issue in my local testing where the first SMS message of a conversation is occasionally dropped. I acknowledge this could be some issue with my local configuration, but wanted to file a ticket to keep logs and discussion in the same place. For context I'm working on migrating Rhizomatica's deployment from osmo-nitb to the split core network components. Most things are working (voice is reliable and subscriber-create-on-demand is working), but this issue is somewhat severe and blocks stable deployment.
Test setup
----------
I am using all components from the Buster latest repo, but can try and reproduce the issue with the nightly components if you all think that should make a difference.
2 UE/MS -> USRP B210 with GPSDO -> osmo-trx-uhd -> osmo-bts-trx -> osmo 2G CS core network (osmo-bsc osmo-msc osmo-stp osmo-mgw osmo-hlr osmo-sip-connector)
The MSC is configured to use an ESME with the smpp-first delivery policy. For this test the ESME just reflects the messages right back to the MSC since source and dest users are present in the local network. Logging seems to indicate this is working correctly, although the interaction of the ESME and sms delivery state machine could be part of the problem.
Problem
-------
The two test UE/MS's have sims 910540000001012 and 910540000000000. For this test I send several messages from ue0000 to ue1012. ue0000 receives message delivery notifications for all messages, shows delivery successful to the sending user, and does not retry sending. ue1012 receives the second and third message, but does not receive the first message although ue0000 displays it was sent!
Info + Speculation
------------------
I captured a fused debug journal from the msc and bsc, which is attached. I annotated it with lines starting with `# matt9j ->` to help find sections of interest in a very verbose log :) My comments are just my best understanding of what is happening, so they may be inaccurate since I'm not an expert in the 2G network. I've also attached the configs of the msc and the bsc.
It seems like there should not be a radio failure since there are only 2 UEs in a network that is very stable otherwise, but the BTS fails to setup a radio link after paging. This could be an issue with the SDR or the bts, but regardless it seems like the message should not be dropped in this case. I think the message is only lost when paging succeeds but the link then fails to be setup. I am not sure if the intended behavior should be for the message to be retried, or for there to be no delivery ack. Either would be okay, but the inconsistency caused by the message being acked but not delivered is not such a good user experience.
Please let me know if there is other debug information I can provide. I'll try to continue to look into this on my own as well and will post findings here.
Files
Related issues
Updated by keith over 3 years ago
Hi, thanks for posting. I think possibly pcaps of the MSC <-> ESME SMPP and a pcap of the Abis (BSC <-> BTS) could be more useful to see what is happening.
If it's all running on localhost, then a pcap of everything except the BTS<->TRX traffic.
Updated by matt9j over 3 years ago
- File sms-test-5.pcapng sms-test-5.pcapng added
- File sms-test-4.pcapng sms-test-4.pcapng added
- File sms-test-3.pcapng sms-test-3.pcapng added
- File sms-test-2.pcapng sms-test-2.pcapng added
- File sms-test-1-pt2.pcapng sms-test-1-pt2.pcapng added
- File sms-test-1-pt1.pcapng sms-test-1-pt1.pcapng added
keith, I re-tested today to try and get the pcaps requested. I tried to reproduce this for about an hour across 5 different test conditions, but was not able to get a message to drop. I did record behavior where messages were delivered out of order, duplicated, and not paged though. Maybe my experience of losing messages before was actually a combination of out-of-order delivery combined with a delayed message? If I am able to reproduce message dropping again I'll try and get a capture, but it's not reproducing in my environment today.
Test 1, simple exchange, witness messages delayed:Send sms message 1 from UE0000 to UE1012
wait 10s
Send sms message 2 from UE0000 to UE1012
wait 10s
Send sms message 3 from UE0000 to UE1012
wait 30s
- At this point still no messages have arrived at UE1012 yet ***
Send 1 reply from UE1012->UE0000
Message 1 arrives at UE1012
Message 2 arrives at UE1012
Message 3 arrives at UE1012
stop pcap
wait 3 minutes
Realize I should probably keep the pcap running, start second pcap
wait 10 minutes - At this point message 4 still has not arrived at UE0000 yet ***
Send message 5 from UE0000 to UE 1012
Message 4 arrives at UE0000
wait 1 minute - Message 5 has not arrived at UE1012 yet***
End pcap
End test
Put both UE in airplane mode to disconnect from network
wait 2 minutes
Start pcap
Disable airplane mode
UEs attach
Message 5 from test 1 arrives at UE1012
wait 2 minutes
Send message two one UE0000 to UE1012
Send message two two UE0000 to UE1012
Send message two three UE0000 to UE1012
Send message two four UE0000 to UE1012
Send message two five UE0000 to UE1012
wait ~3 minutes
- At this point still no messages have arrived at UE1012 yet ***
Send reply message two six UE1012 to UE0000
Receive message two one at UE1012
Receive message two two at UE1012
Receive Duplicate message two two at UE1012
Receive message two three at UE1012
Receive message two four at UE1012
Receive message two five at UE1012
wait ~1minute
Send reply message two seven from UE1012 to UE0000
at the same time send message two eight from UE0000 to UE1012
Receive message two eight at UE1012
Receive duplicate message two eight at UE1012
wait ~2 minutes - Message two seven still has not arrived at UE0000 ***
Send message two nine from UE0000 to UE1012
Receive message two six at UE0000
Receive message two seven at UE0000
Stop pcap
End test 2
Test 3, try a quick exchange, witness some messages not delayed, others delayed, reordering, and duplication:
Put UE0000 and UE1012 in airplane mode
Remove UE0000 from airplane mode
Send message 3 one from UE0000 to UE1012
Send message 3 two from UE0000 to UE1012
Send message 3 three from UE0000 to UE1012
Remove UE1012 from airplane mode
Receive message 2 nine on UE1012
Send message 3 four from UE1012 to UE0000
Receive message 3 four on UE0000
Receive duplicate message 3 four on UE0000
Receive message 3 one on UE1012
Receive duplicate message 3 one on UE1012
Receive message 3 three on UE1012
Receive message 3 two on UE1012
- send a series of quick replies and responses back and forth***
- No messages lost but usually not delivered until an outgoing message is sent***
Airplane mode and Remove from airplane mode UE0000 and UE1012
Start pcap 4
Receive message two 18 (last of prior exchange) at UE0000
Send 4 one UE0000 to UE1012
send 4 two UE0000 to UE1012
Airplane mode UE1012
Send 4 three from UE0000 to UE1012
Send 4 four from UE0000 to UE1012
Remove UE1012 from airplane mode
Receive message 4 one at UE1012
Wait 1 minute
- At this point messages 4-two 4-three and 4-four have still not arrived ***
wait 1 minute
Send message 4 five to UE1012
Send reply message 4 six to UE0000
Receive duplicate message 4 one at UE1012
Receive message 4 two at UE1012
Receive duplicate message 4 two at UE1012
Receive message 4 three at UE1012
Receive duplicate message 4 three at UE1012
Receive message 4 four at UE1012
Receive message 4 five at UE1012
stop pcap 4
Test 5, try an MSC restart, witness delay:
Start pcap5
Airplane mode both UEs
Restart BTS and MSC
Remove both UEs from airplane mode
Receive Reply 4-six on UE0000
Send message 5 one from UE1012 to UE0000
Send message 5 two from UE1012 to UE0000
Send message 5 three from UE1012 to UE0000
Send message 5 four from UE1012 to UE0000
Send message 5 five from UE1012 to UE0000
Send message 5 six from UE0000 to UE1012
Receive message 5 one at UE0000
Receive message 5 two at UE0000
Receive message 5 three at UE0000
Receive message 5 four at UE0000
Receive message 5 five at UE0000
Send message 5 seven from UE0000 to UE1012
Send message 5 eight from UE1012 to UE0000
Send message 5 nine from UE1012 to UE0000
Receive message 5 six at UE1012
Receive message 5 seven at UE1012
Stop pcap 5
Updated by fixeria over 3 years ago
- Status changed from New to In Progress
- Assignee set to fixeria
- Priority changed from Normal to High
Thanks for your detailed report!
I am currently investigating a very similar MT SMS delivery failure. The problem is that osmo-bsc fails to establish a SAPI=3 link with the MT SMS receiver after getting Paging Response message (on SAPI=0). This can be seen in the attached capture(s), as well as in the logs:
334 29.382874 172.16.0.1 → 172.16.0.99 RSL 89 PAGING CoMmanD 337 29.573677 172.16.0.99 → 172.16.0.1 RSL 96 ESTablish INDication (DTAP) (RR) Paging Response 339 29.574730 187 → 185 BSSAP 152 CR (BSSMAP) Complete Layer 3 Information (DTAP) (RR) Paging Response 340 29.574937 187 → 185 BSSAP 152 CR (BSSMAP) Complete Layer 3 Information (DTAP) (RR) Paging Response 344 29.576075 185 → 187 GSM SMS 156 DT1 (DTAP) (SMS) CP-DATA (RP) RP-DATA (Network to MS) 347 29.576345 185 → 187 GSM SMS 156 DT1 (DTAP) (SMS) CP-DATA (RP) RP-DATA (Network to MS) 349 29.576824 172.16.0.1 → 172.16.0.99 RSL 77 ESTablish REQuest // <-- (!) 360 29.722692 172.16.0.99 → 172.16.0.1 RSL 110 MEASurement RESult (DTAP) (RR) Measurement Report 363 30.044439 172.16.0.99 → 172.16.0.1 RSL 77 RELease INDication // <-- (!) 365 30.045087 187 → 185 BSSAP 108 DT1 (BSSMAP) SAPI 'n' Reject 366 30.045323 187 → 185 BSSAP 108 DT1 (BSSMAP) SAPI 'n' Reject 372 30.193222 172.16.0.99 → 172.16.0.1 RSL 110 MEASurement RESult (DTAP) (RR) Measurement Report 377 30.361168 172.16.0.99 → 172.16.0.1 RSL 78 CCCH LOAD INDication 379 30.663908 172.16.0.99 → 172.16.0.1 RSL 110 MEASurement RESult (DTAP) (RR) Measurement Report 382 31.092130 172.16.0.1 → 172.16.0.99 RSL 75 RF CHANnel RELease 384 31.093300 172.16.0.99 → 172.16.0.1 RSL 75 RF CHANnel RELease ACKnowledge
The reason why we get RELease INDication instead of ESTablish CONFirmation is to be investigated. See packets with numbers 363, 717, 1322 in your https://osmocom.org/attachments/download/4254/sms-test-1-pt1.pcapng.
Updated by fixeria over 3 years ago
- % Done changed from 0 to 40
We have found a problem in LAPDm code causing MT SMS delivery failures, please see #4731.
Updated by fixeria over 3 years ago
- Related to Bug #4731: LAPDm does not implement SAPI priorities between data link layers added
Updated by fixeria over 3 years ago
- Status changed from In Progress to Feedback
I've just submitted a unit test reproducing the problem and potential fix:
https://gerrit.osmocom.org/c/libosmocore/+/19847 lapd_test: add a test checking SAPI0/SAPI3 prioritization
https://gerrit.osmocom.org/c/libosmocore/+/19853 lapdm: fix SAPI-0/SAPI-3 frame prioritization on DCCH
both changes have been merged. This should fix MT SMS delivery, feel free to re-try with the recent master.
Updated by matt9j about 3 years ago
I was able to test today with MSC version 1.6.3 which uses libosmocore 1.4.1 and I believe includes the relevant patches. Delivery was reliable, so I believe this ticket can be closed now, but I now consistently get double messages, which was only happening occasionally before. The double message phenomenon may be on my end related to integration with an ESME or otherwise, so I will investigate more before opening a separate ticket for that issue. Thanks fixeria for the quick attention and core issue fix!
I don't think I have permissions to edit the ticket status myself (I cannot find the option at least-- sorry if I've missed it, I'm new to Redmine!).
Updated by fixeria about 3 years ago
- Status changed from Feedback to Resolved
- % Done changed from 40 to 100
I don't think I have permissions to edit the ticket status myself (I cannot find the option at least-- sorry if I've missed it, I'm new to Redmine!).
No worries! Thanks for keeping us updated ;)
Updated by laforge about 3 years ago
On Sun, Sep 20, 2020 at 10:09:30PM +0000, matt9j [REDMINE] wrote:
I don't think I have permissions to edit the ticket status myself (I cannot find the option at least-- sorry if I've missed it, I'm new to Redmine!).
that should have changed now.