Project

General

Profile

Actions

Bug #4714

closed

SMS Message Lost Although Delivery Ack Sent

Added by matt9j over 3 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
SMS
Target version:
-
Start date:
08/16/2020
Due date:
% Done:

100%

Resolution:
Spec Reference:

Description

Hi! I've run into a strange issue in my local testing where the first SMS message of a conversation is occasionally dropped. I acknowledge this could be some issue with my local configuration, but wanted to file a ticket to keep logs and discussion in the same place. For context I'm working on migrating Rhizomatica's deployment from osmo-nitb to the split core network components. Most things are working (voice is reliable and subscriber-create-on-demand is working), but this issue is somewhat severe and blocks stable deployment.

Test setup
----------

I am using all components from the Buster latest repo, but can try and reproduce the issue with the nightly components if you all think that should make a difference.

2 UE/MS -> USRP B210 with GPSDO -> osmo-trx-uhd -> osmo-bts-trx -> osmo 2G CS core network (osmo-bsc osmo-msc osmo-stp osmo-mgw osmo-hlr osmo-sip-connector)

The MSC is configured to use an ESME with the smpp-first delivery policy. For this test the ESME just reflects the messages right back to the MSC since source and dest users are present in the local network. Logging seems to indicate this is working correctly, although the interaction of the ESME and sms delivery state machine could be part of the problem.

Problem
-------

The two test UE/MS's have sims 910540000001012 and 910540000000000. For this test I send several messages from ue0000 to ue1012. ue0000 receives message delivery notifications for all messages, shows delivery successful to the sending user, and does not retry sending. ue1012 receives the second and third message, but does not receive the first message although ue0000 displays it was sent!

Info + Speculation
------------------

I captured a fused debug journal from the msc and bsc, which is attached. I annotated it with lines starting with `# matt9j ->` to help find sections of interest in a very verbose log :) My comments are just my best understanding of what is happening, so they may be inaccurate since I'm not an expert in the 2G network. I've also attached the configs of the msc and the bsc.

It seems like there should not be a radio failure since there are only 2 UEs in a network that is very stable otherwise, but the BTS fails to setup a radio link after paging. This could be an issue with the SDR or the bts, but regardless it seems like the message should not be dropped in this case. I think the message is only lost when paging succeeds but the link then fails to be setup. I am not sure if the intended behavior should be for the message to be retried, or for there to be no delivery ack. Either would be okay, but the inconsistency caused by the message being acked but not delivered is not such a good user experience.

Please let me know if there is other debug information I can provide. I'll try to continue to look into this on my own as well and will post findings here.


Files

lost-first-message-zero-to-twelve.txt lost-first-message-zero-to-twelve.txt 310 KB fused journal log of bsc and msc when message is lost matt9j, 08/16/2020 07:06 PM
osmo-msc.cfg osmo-msc.cfg 1.28 KB matt9j, 08/16/2020 07:07 PM
osmo-bsc.cfg osmo-bsc.cfg 3.85 KB matt9j, 08/16/2020 07:07 PM
sms-test-4.pcapng sms-test-4.pcapng 735 KB matt9j, 08/17/2020 11:52 PM
sms-test-5.pcapng sms-test-5.pcapng 1.03 MB matt9j, 08/17/2020 11:52 PM
sms-test-3.pcapng sms-test-3.pcapng 1.1 MB matt9j, 08/17/2020 11:52 PM
sms-test-1-pt2.pcapng sms-test-1-pt2.pcapng 817 KB matt9j, 08/17/2020 11:52 PM
sms-test-1-pt1.pcapng sms-test-1-pt1.pcapng 293 KB matt9j, 08/17/2020 11:52 PM
sms-test-2.pcapng sms-test-2.pcapng 1.03 MB matt9j, 08/17/2020 11:52 PM

Related issues

Related to libosmocore - Bug #4731: LAPDm does not implement SAPI priorities between data link layersStalledfixeria08/26/2020

Actions
Actions #1

Updated by keith over 3 years ago

Hi, thanks for posting. I think possibly pcaps of the MSC <-> ESME SMPP and a pcap of the Abis (BSC <-> BTS) could be more useful to see what is happening.

If it's all running on localhost, then a pcap of everything except the BTS<->TRX traffic.

Actions #2

Updated by matt9j over 3 years ago

keith, I re-tested today to try and get the pcaps requested. I tried to reproduce this for about an hour across 5 different test conditions, but was not able to get a message to drop. I did record behavior where messages were delivered out of order, duplicated, and not paged though. Maybe my experience of losing messages before was actually a combination of out-of-order delivery combined with a delayed message? If I am able to reproduce message dropping again I'll try and get a capture, but it's not reproducing in my environment today.

Test 1, simple exchange, witness messages delayed:
Send sms message 1 from UE0000 to UE1012
wait 10s
Send sms message 2 from UE0000 to UE1012
wait 10s
Send sms message 3 from UE0000 to UE1012
wait 30s
  • At this point still no messages have arrived at UE1012 yet ***
    Send 1 reply from UE1012->UE0000
    Message 1 arrives at UE1012
    Message 2 arrives at UE1012
    Message 3 arrives at UE1012
    stop pcap
    wait 3 minutes
    Realize I should probably keep the pcap running, start second pcap
    wait 10 minutes
  • At this point message 4 still has not arrived at UE0000 yet ***
    Send message 5 from UE0000 to UE 1012
    Message 4 arrives at UE0000
    wait 1 minute
  • Message 5 has not arrived at UE1012 yet***
    End pcap
    End test
Test 2, try multiple messages, witness duplication:
Put both UE in airplane mode to disconnect from network
wait 2 minutes
Start pcap
Disable airplane mode
UEs attach
Message 5 from test 1 arrives at UE1012
wait 2 minutes
Send message two one UE0000 to UE1012
Send message two two UE0000 to UE1012
Send message two three UE0000 to UE1012
Send message two four UE0000 to UE1012
Send message two five UE0000 to UE1012
wait ~3 minutes
  • At this point still no messages have arrived at UE1012 yet ***
    Send reply message two six UE1012 to UE0000
    Receive message two one at UE1012
    Receive message two two at UE1012
    Receive Duplicate message two two at UE1012
    Receive message two three at UE1012
    Receive message two four at UE1012
    Receive message two five at UE1012
    wait ~1minute
    Send reply message two seven from UE1012 to UE0000
    at the same time send message two eight from UE0000 to UE1012
    Receive message two eight at UE1012
    Receive duplicate message two eight at UE1012
    wait ~2 minutes
  • Message two seven still has not arrived at UE0000 ***
    Send message two nine from UE0000 to UE1012
    Receive message two six at UE0000
    Receive message two seven at UE0000
    Stop pcap
    End test 2

Test 3, try a quick exchange, witness some messages not delayed, others delayed, reordering, and duplication:
Put UE0000 and UE1012 in airplane mode
Remove UE0000 from airplane mode
Send message 3 one from UE0000 to UE1012
Send message 3 two from UE0000 to UE1012
Send message 3 three from UE0000 to UE1012
Remove UE1012 from airplane mode
Receive message 2 nine on UE1012
Send message 3 four from UE1012 to UE0000
Receive message 3 four on UE0000
Receive duplicate message 3 four on UE0000
Receive message 3 one on UE1012
Receive duplicate message 3 one on UE1012
Receive message 3 three on UE1012
Receive message 3 two on UE1012

  • send a series of quick replies and responses back and forth***
  • No messages lost but usually not delivered until an outgoing message is sent***
Test 4, try sending while one UE disconnected, witness mixed delay and duplication:
Airplane mode and Remove from airplane mode UE0000 and UE1012
Start pcap 4
Receive message two 18 (last of prior exchange) at UE0000
Send 4 one UE0000 to UE1012
send 4 two UE0000 to UE1012
Airplane mode UE1012
Send 4 three from UE0000 to UE1012
Send 4 four from UE0000 to UE1012
Remove UE1012 from airplane mode
Receive message 4 one at UE1012
Wait 1 minute
  • At this point messages 4-two 4-three and 4-four have still not arrived ***
    wait 1 minute
    Send message 4 five to UE1012
    Send reply message 4 six to UE0000
    Receive duplicate message 4 one at UE1012
    Receive message 4 two at UE1012
    Receive duplicate message 4 two at UE1012
    Receive message 4 three at UE1012
    Receive duplicate message 4 three at UE1012
    Receive message 4 four at UE1012
    Receive message 4 five at UE1012
    stop pcap 4

Test 5, try an MSC restart, witness delay:
Start pcap5
Airplane mode both UEs
Restart BTS and MSC
Remove both UEs from airplane mode
Receive Reply 4-six on UE0000
Send message 5 one from UE1012 to UE0000
Send message 5 two from UE1012 to UE0000
Send message 5 three from UE1012 to UE0000
Send message 5 four from UE1012 to UE0000
Send message 5 five from UE1012 to UE0000
Send message 5 six from UE0000 to UE1012
Receive message 5 one at UE0000
Receive message 5 two at UE0000
Receive message 5 three at UE0000
Receive message 5 four at UE0000
Receive message 5 five at UE0000
Send message 5 seven from UE0000 to UE1012
Send message 5 eight from UE1012 to UE0000
Send message 5 nine from UE1012 to UE0000
Receive message 5 six at UE1012
Receive message 5 seven at UE1012
Stop pcap 5

Actions #3

Updated by fixeria over 3 years ago

  • Status changed from New to In Progress
  • Assignee set to fixeria
  • Priority changed from Normal to High

Thanks for your detailed report!

I am currently investigating a very similar MT SMS delivery failure. The problem is that osmo-bsc fails to establish a SAPI=3 link with the MT SMS receiver after getting Paging Response message (on SAPI=0). This can be seen in the attached capture(s), as well as in the logs:

  334  29.382874   172.16.0.1 → 172.16.0.99  RSL 89 PAGING CoMmanD 
  337  29.573677  172.16.0.99 → 172.16.0.1   RSL 96 ESTablish INDication (DTAP) (RR) Paging Response 
  339  29.574730          187 → 185          BSSAP 152 CR (BSSMAP) Complete Layer 3 Information (DTAP) (RR) Paging Response 
  340  29.574937          187 → 185          BSSAP 152 CR (BSSMAP) Complete Layer 3 Information (DTAP) (RR) Paging Response 
  344  29.576075          185 → 187          GSM SMS 156 DT1 (DTAP) (SMS) CP-DATA (RP) RP-DATA (Network to MS) 
  347  29.576345          185 → 187          GSM SMS 156 DT1 (DTAP) (SMS) CP-DATA (RP) RP-DATA (Network to MS) 
  349  29.576824   172.16.0.1 → 172.16.0.99  RSL 77 ESTablish REQuest      // <-- (!)
  360  29.722692  172.16.0.99 → 172.16.0.1   RSL 110 MEASurement RESult (DTAP) (RR) Measurement Report 
  363  30.044439  172.16.0.99 → 172.16.0.1   RSL 77 RELease INDication     // <-- (!)
  365  30.045087          187 → 185          BSSAP 108 DT1 (BSSMAP) SAPI 'n' Reject 
  366  30.045323          187 → 185          BSSAP 108 DT1 (BSSMAP) SAPI 'n' Reject
  372  30.193222  172.16.0.99 → 172.16.0.1   RSL 110 MEASurement RESult (DTAP) (RR) Measurement Report 
  377  30.361168  172.16.0.99 → 172.16.0.1   RSL 78 CCCH LOAD INDication 
  379  30.663908  172.16.0.99 → 172.16.0.1   RSL 110 MEASurement RESult (DTAP) (RR) Measurement Report 
  382  31.092130   172.16.0.1 → 172.16.0.99  RSL 75 RF CHANnel RELease 
  384  31.093300  172.16.0.99 → 172.16.0.1   RSL 75 RF CHANnel RELease ACKnowledge

The reason why we get RELease INDication instead of ESTablish CONFirmation is to be investigated. See packets with numbers 363, 717, 1322 in your https://osmocom.org/attachments/download/4254/sms-test-1-pt1.pcapng.

Actions #4

Updated by fixeria over 3 years ago

  • % Done changed from 0 to 40

We have found a problem in LAPDm code causing MT SMS delivery failures, please see #4731.

Actions #5

Updated by fixeria over 3 years ago

  • Related to Bug #4731: LAPDm does not implement SAPI priorities between data link layers added
Actions #6

Updated by fixeria over 3 years ago

  • Status changed from In Progress to Feedback

I've just submitted a unit test reproducing the problem and potential fix:

https://gerrit.osmocom.org/c/libosmocore/+/19847 lapd_test: add a test checking SAPI0/SAPI3 prioritization
https://gerrit.osmocom.org/c/libosmocore/+/19853 lapdm: fix SAPI-0/SAPI-3 frame prioritization on DCCH

both changes have been merged. This should fix MT SMS delivery, feel free to re-try with the recent master.

Actions #7

Updated by matt9j over 3 years ago

I was able to test today with MSC version 1.6.3 which uses libosmocore 1.4.1 and I believe includes the relevant patches. Delivery was reliable, so I believe this ticket can be closed now, but I now consistently get double messages, which was only happening occasionally before. The double message phenomenon may be on my end related to integration with an ESME or otherwise, so I will investigate more before opening a separate ticket for that issue. Thanks fixeria for the quick attention and core issue fix!

I don't think I have permissions to edit the ticket status myself (I cannot find the option at least-- sorry if I've missed it, I'm new to Redmine!).

Actions #8

Updated by fixeria over 3 years ago

  • Status changed from Feedback to Resolved
  • % Done changed from 40 to 100

I don't think I have permissions to edit the ticket status myself (I cannot find the option at least-- sorry if I've missed it, I'm new to Redmine!).

No worries! Thanks for keeping us updated ;)

Actions #9

Updated by laforge over 3 years ago

On Sun, Sep 20, 2020 at 10:09:30PM +0000, matt9j [REDMINE] wrote:

I don't think I have permissions to edit the ticket status myself (I cannot find the option at least-- sorry if I've missed it, I'm new to Redmine!).

that should have changed now.

Actions

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)