Project

General

Profile

Bug #2885

OsmoMSC crashes on MNCC disconnect

Added by laforge 9 months ago. Updated 17 days ago.

Status:
Feedback
Priority:
Urgent
Assignee:
Category:
MGCP towards MGW
Target version:
-
Start date:
01/27/2018
Due date:
% Done:

0%

Resolution:

Description

<0004> gsm_04_08.c:1359 transmit message MNCC_CALL_CONF_IND
<0007> msc_mgcp.c:280 MGW(MGW_0)[0xa9bb0c0]{ST_CRCX_CN}: CRCX/RAN: response yields error: 542 FORCED_FAIL
<0007> msc_mgcp.c:281 MGW(MGW_0)[0xa9bb0c0]{ST_CRCX_CN}: operation failed on MGW -- graceful shutdown...
<0007> msc_mgcp.c:730 MGW(MGW_0)[0xa9bb0c0]{ST_HALT}: DLCX: response yields error: 250 OK
<0007> msc_mgcp.c:731 MGW(MGW_0)[0xa9bb0c0]{ST_HALT}: operation failed on MGW -- graceful shutdown...
<0007> msc_mgcp.c:157 MGW(MGW_0)[0xa9bb0c0]{ST_HALT}: transition to state ST_CALL not permitted!
<0004> gsm_04_08.c:1359 transmit message MNCC_DISC_IND
<0012> input/ipa.c:67 connection closed with server
<0004> mncc_sock.c:85 MNCC Socket has LOST connection
<0001> gsm_04_08.c:191 Clearing all currently active transactions!!!
==17608== Invalid read of size 8
==17608==    at 0x128B6A: msc_mgcp_call_release (msc_mgcp.c:1052)
==17608==    by 0x11ED50: _gsm48_cc_trans_free (gsm_04_08.c:1419)
==17608==    by 0x12BF94: trans_free (transaction.c:123)
==17608==    by 0x11CFEA: gsm0408_clear_all_trans (gsm_04_08.c:196)
==17608==    by 0x125A07: mncc_sock_close (mncc_sock.c:95)
==17608==    by 0x125B1E: mncc_sock_read (mncc_sock.c:140)
==17608==    by 0x125B1E: mncc_sock_cb (mncc_sock.c:198)
==17608==    by 0x56D0950: osmo_fd_disp_fds (select.c:216)
==17608==    by 0x56D0950: osmo_select_main (select.c:256)
==17608==    by 0x11371B: main (msc_main.c:546)
==17608==  Address 0xaaa3810 is 96 bytes inside a block of size 200 free'd
==17608==    at 0x4C2DDBB: free (vg_replace_malloc.c:530)
==17608==    by 0x505BE82: _talloc_free (in /usr/lib/x86_64-linux-gnu/libtalloc.so.2.1.10)
==17608==    by 0x56D3C8E: _osmo_fsm_inst_dispatch (fsm.c:450)
==17608==    by 0x12830B: fsm_timeout_cb (msc_mgcp.c:204)
==17608==    by 0x56D4458: fsm_tmr_cb (fsm.c:185)
==17608==    by 0x56D0305: osmo_timers_update (timer.c:257)
==17608==    by 0x56D0904: osmo_select_main (select.c:253)
==17608==    by 0x11371B: main (msc_main.c:546)
==17608==  Block was alloc'd at
==17608==    at 0x4C2CB8F: malloc (vg_replace_malloc.c:299)
==17608==    by 0x505E150: _talloc_zero (in /usr/lib/x86_64-linux-gnu/libtalloc.so.2.1.10)
==17608==    by 0x128448: msc_mgcp_call_assignment (msc_mgcp.c:902)
==17608==    by 0x11C578: gsm48_cc_rx_call_conf (gsm_04_08.c:1847)
==17608==    by 0x11FE8C: gsm0408_rcv_cc (gsm_04_08.c:3269)
==17608==    by 0x11FE8C: gsm0408_dispatch (gsm_04_08.c:3380)
==17608==    by 0x12D05C: msc_dtap (osmo_msc.c:108)
==17608==    by 0x116BB2: rx_dtap (a_iface_bssap.c:683)
==17608==    by 0x116BB2: a_sccp_rx_dt (a_iface_bssap.c:710)
==17608==    by 0x114367: sccp_sap_up (a_iface.c:529)
==17608==    by 0x56D3C8E: _osmo_fsm_inst_dispatch (fsm.c:450)
==17608==    by 0x5D5D9D4: sccp_scoc_rx_from_scrc (sccp_scoc.c:1581)
==17608==    by 0x5D5B6CA: scrc_rx_mtp_xfer_ind_xua (sccp_scrc.c:449)
==17608==    by 0x5D5E5A4: mtp_user_prim_cb (sccp_user.c:176)
==17608== 

So it seems that upon MNCC disconnect, it tries to free some MGCP state again, which was already free'd due to an earlier MGCP failure.

History

#1 Updated by laforge 8 months ago

  • Assignee changed from dexter to daniel

#2 Updated by laforge 7 months ago

  • Assignee changed from daniel to sysmocom

#3 Updated by laforge 7 months ago

  • Assignee changed from sysmocom to stsp

#4 Updated by neels 7 months ago

(if this touches MSC's subscriber connection allocation/cleanup scheme at all, rather wait for the pending refactoring patch to be merged; if in doubt, ask me)

#5 Updated by laforge 2 months ago

  • Priority changed from High to Urgent

reclassifying as urgent. If we have a reproducible crash in osmo-mcs for 6 months, it's a really serious issue that needs to be investigate now.

#6 Updated by stsp 2 months ago

Unfortunately, there is not enough context provided in this issue to allow anyone to easily reproduce the problem.
At least not enough context for someone who has never even used this part of the software stack before.

A simple test where osmo-sip-connector is started and stopped on the mncc socket does not trigger the crash.
The pcap file suggests that a phone call is required. What is the best way to simulate this locally?
Will I need a full-blown osmo-sip-connector + asterisk setup and make an actual call?

Do we already know something about what needs to be done to reproduce the related "earlier MGCP failure" mentioned in the issue description?

#7 Updated by stsp 2 months ago

  • Status changed from New to In Progress

#8 Updated by fixeria 2 months ago

What is the best way to simulate this locally?

I think it makes sense to try to disconnect MNCC from OsmoMSC during an active call,
so it will trigger cleaning up process of the active transactions...

#9 Updated by fixeria 2 months ago

Will I need a full-blown osmo-sip-connector + asterisk setup and make an actual call?

Not exclusively Asterisk, but yes, you need some PBX. There are also Freeswitch
(I can help with configuration), and LCR (Linux Call Router)...

#10 Updated by laforge 2 months ago

On Tue, Aug 07, 2018 at 10:47:36AM +0000, fixeria [REDMINE] wrote:

Will I need a full-blown osmo-sip-connector + asterisk setup and make an actual call?

Not exclusively Asterisk, but yes, you need some PBX. There are also Freeswitch
(I can help with configuration), and LCR (Linux Call Router)...

we also have mncc-python examples which will establish a variety of calls from the MNCC
side without any PBX:

https://git.osmocom.org/mncc-python/

#11 Updated by stsp 2 months ago

laforge wrote:

we also have mncc-python examples which will establish a variety of calls from the MNCC
side without any PBX:

https://git.osmocom.org/mncc-python/

That is neat. I am trying to get this to work, however the call setup seems to fail because the subscribers aren't known to the VLR:

DCC <0001> gsm_04_08_cc.c:1875 (bts - trx - ts - ti -- sub 1111) Received 'MNCC_SETUP_REQ' from MNCC with unknown subscriber 1111

The subscribers are present in the HLR database. Perhaps some steps are missing in this simulation (e.g. I don't see where a location update would be sent)?

In any case, this isn't the only bug assigned to me which is stalling because I don't have a working physical setup to test with. Perhaps I should finally invest some time into setting that up...

#12 Updated by laforge 2 months ago

On Thu, Aug 09, 2018 at 11:41:16AM +0000, stsp [REDMINE] wrote:

That is neat. I am trying to get this to work, however the call setup seems to fail because the subscribers aren't known to the VLR:

the assumption when writing the code is that you use this with real
phones which of course perform a LU before you then use mncc-python to
create a number of voice calls between the phones.

#13 Updated by stsp 2 months ago

OK I will look into getting a more elaborate setup that can reproduce this problem.

Do you by chance remember in which environment the original error was found?
Would knowing more about that environment help me in some way?

#14 Updated by stsp 22 days ago

I have managed to get a setup where the MSC crashes reproducably with just one phone on the network dialing to another subscriber that is not registered on the network. (Edit: Thanks Neels for help with getting up and running, again!)

This crash looks similar, but I'm not yet sure if it is the same problem.
Regardless, I'll try to fix this one now.

20180927170812288 DLSCCP <001e> sccp_scoc.c:1548 Received CO:CODT for local reference 1
20180927170812289 DLSCCP <001e> sccp_scoc.c:1581 SCCP-SCOC(1)[0x612000011aa0]{ACTIVE}: Received Event RCOC-DT1.ind
20180927170812289 DLSCCP <001e> sccp_user.c:156 Delivering N-DATA.indication to SCCP User 'OsmoMSC-A'
20180927170812289 DBSSAP <0010> a_iface.c:564 N-DATA.ind(1, 01 00 02 03 aa )
20180927170812289 DMSC <0006> a_iface_bssap.c:80 Looking for A subscriber: conn_id 1
20180927170812289 DBSSAP <0010> a_iface_bssap.c:88 (subscr MSISDN:12345, conn_id 1) Found A subscriber for conn_id 1
20180927170812289 DBSSAP <0010> a_iface_bssap.c:674 (subscr MSISDN:12345, conn_id 1) Rx DTAP 01 00 02 03 aa
20180927170812289 DRLL <0000> gsm_04_08.c:1439 Dispatching 04.08 message GSM48_MT_CC_RELEASE_COMPL (0x3:0x2a)
20180927170812289 DCC <0001> gsm_04_08_cc.c:176 stopping pending timer T308
20180927170812289 DMNCC <0004> gsm_04_08_cc.c:188 transmit message MNCC_REL_CNF
20180927170812289 DCC <0001> gsm_04_08_cc.c:211 Sending 'MNCC_REL_CNF' to MNCC.
20180927170812289 DMNCC <0004> mncc_builtin.c:311 (call 80000001) Received message MNCC_REL_CNF
20180927170812290 DMNCC <0004> mncc_builtin.c:51 (call 80000001) Call removed.
20180927170812290 DMGCP <0007> msc_mgcp.c:1169 MGW(MGW_8)[0x612000014320]{ST_CRCX_CN}: Received Event EV_TEARDOWN
20180927170812290 DMGCP <0007> msc_mgcp.c:408 MGW(MGW_8)[0x612000014320]{ST_CRCX_CN}: unexpected connection teardown -- graceful shutdown...
20180927170812290 DMGCP <0007> msc_mgcp.c:205 MGW(MGW_8)[0x612000014320]{ST_CRCX_CN}: state_chg to ST_HALT
20180927170812290 DMGCP <0007> msc_mgcp.c:206 MGW(MGW_8)[0x612000014320]{ST_HALT}: Received Event EV_TEARDOWN_ERROR
20180927170812290 DLMGCP <0021> mgcp_client.c:979 Canceled transaction 1
20180927170812290 DMGCP <0007> fsm.c:381 MGW(MGW_8)[0x612000014320]{ST_HALT}: Deallocated
=================================================================
==27033==ERROR: AddressSanitizer: heap-use-after-free on address 0x61700000616c at pc 0x55a07a9abf22 bp 0x7ffce189cb40 sp 0x7ffce189cb30
READ of size 1 at 0x61700000616c thread T0
    #0 0x55a07a9abf21 in _handle_error /home/stsp/osmo/osmo-msc/src/libmsc/msc_mgcp.c:213
    #1 0x55a07a9aefe5 in fsm_crcx_cn_cb /home/stsp/osmo/osmo-msc/src/libmsc/msc_mgcp.c:408
    #2 0x7f76f0358d13 in _osmo_fsm_inst_dispatch /home/stsp/osmo/libosmocore/src/fsm.c:580
    #3 0x55a07a9b7f80 in msc_mgcp_call_release /home/stsp/osmo/osmo-msc/src/libmsc/msc_mgcp.c:1169
    #4 0x55a07a984c96 in _gsm48_cc_trans_free /home/stsp/osmo/osmo-msc/src/libmsc/gsm_04_08_cc.c:248
    #5 0x55a07a9bd75d in trans_free /home/stsp/osmo/osmo-msc/src/libmsc/transaction.c:125
    #6 0x55a07a98c876 in gsm48_cc_rx_release_compl /home/stsp/osmo/osmo-msc/src/libmsc/gsm_04_08_cc.c:1157
    #7 0x55a07a993e73 in gsm0408_rcv_cc /home/stsp/osmo/osmo-msc/src/libmsc/gsm_04_08_cc.c:2135
    #8 0x55a07a980d6e in gsm0408_dispatch /home/stsp/osmo/osmo-msc/src/libmsc/gsm_04_08.c:1467
    #9 0x55a07a9bee06 in msc_dtap /home/stsp/osmo/osmo-msc/src/libmsc/osmo_msc.c:125
    #10 0x55a07a963312 in rx_dtap /home/stsp/osmo/osmo-msc/src/libmsc/a_iface_bssap.c:681
    #11 0x55a07a963589 in a_sccp_rx_dt /home/stsp/osmo/osmo-msc/src/libmsc/a_iface_bssap.c:703
    #12 0x55a07a95c550 in sccp_sap_up /home/stsp/osmo/osmo-msc/src/libmsc/a_iface.c:565
    #13 0x7f76ef7d6d10 in sccp_user_prim_up /home/stsp/osmo/libosmo-sccp/src/sccp_user.c:157
    #14 0x7f76ef7d0683 in scu_gen_encode_and_send /home/stsp/osmo/libosmo-sccp/src/sccp_scoc.c:704
    #15 0x7f76ef7d1ba7 in scoc_fsm_active /home/stsp/osmo/libosmo-sccp/src/sccp_scoc.c:1015
    #16 0x7f76f0358d13 in _osmo_fsm_inst_dispatch /home/stsp/osmo/libosmocore/src/fsm.c:580
    #17 0x7f76ef7d46ae in sccp_scoc_rx_from_scrc /home/stsp/osmo/libosmo-sccp/src/sccp_scoc.c:1581
    #18 0x7f76ef7c7a69 in scrc_rx_mtp_xfer_ind_xua /home/stsp/osmo/libosmo-sccp/src/sccp_scrc.c:457
    #19 0x7f76ef7d703b in mtp_user_prim_cb /home/stsp/osmo/libosmo-sccp/src/sccp_user.c:176
    #20 0x7f76ef7f2122 in deliver_to_mtp_user /home/stsp/osmo/libosmo-sccp/src/osmo_ss7_hmrt.c:94
    #21 0x7f76ef7f24b5 in hmdt_message_for_distribution /home/stsp/osmo/libosmo-sccp/src/osmo_ss7_hmrt.c:133
    #22 0x7f76ef7f34a5 in m3ua_hmdc_rx_from_l2 /home/stsp/osmo/libosmo-sccp/src/osmo_ss7_hmrt.c:275
    #23 0x7f76ef7b3c88 in m3ua_rx_xfer /home/stsp/osmo/libosmo-sccp/src/m3ua.c:586
    #24 0x7f76ef7b4900 in m3ua_rx_msg /home/stsp/osmo/libosmo-sccp/src/m3ua.c:739
    #25 0x7f76ef7ee95a in xua_cli_read_cb /home/stsp/osmo/libosmo-sccp/src/osmo_ss7.c:1590
    #26 0x7f76ed17ac26 in osmo_stream_cli_read /home/stsp/osmo/libosmo-netif/src/stream.c:192
    #27 0x7f76ed17bb2b in osmo_stream_cli_fd_cb /home/stsp/osmo/libosmo-netif/src/stream.c:276
    #28 0x7f76f0341573 in osmo_fd_disp_fds /home/stsp/osmo/libosmocore/src/select.c:217
    #29 0x7f76f0341874 in osmo_select_main /home/stsp/osmo/libosmocore/src/select.c:257
    #30 0x55a07a956f5c in main /home/stsp/osmo/osmo-msc/src/osmo-msc/msc_main.c:701
    #31 0x7f76edfc2b96 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21b96)
    #32 0x55a07a9554d9 in _start (/home/stsp/osmo/prefix/bin/osmo-msc+0xfc4d9)

0x61700000616c is located 620 bytes inside of 712-byte region [0x617000005f00,0x6170000061c8)
freed by thread T0 here:
    #0 0x7f76f13ac7b8 in __interceptor_free (/usr/lib/x86_64-linux-gnu/libasan.so.4+0xde7b8)
    #1 0x7f76f0d9ca52 in _talloc_free (/usr/lib/x86_64-linux-gnu/libtalloc.so.2+0x3a52)

previously allocated by thread T0 here:
    #0 0x7f76f13acb50 in __interceptor_malloc (/usr/lib/x86_64-linux-gnu/libasan.so.4+0xdeb50)
    #1 0x7f76f0d9ed20 in _talloc_zero (/usr/lib/x86_64-linux-gnu/libtalloc.so.2+0x5d20)

SUMMARY: AddressSanitizer: heap-use-after-free /home/stsp/osmo/osmo-msc/src/libmsc/msc_mgcp.c:213 in _handle_error
Shadow bytes around the buggy address:
  0x0c2e7fff8bd0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c2e7fff8be0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c2e7fff8bf0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c2e7fff8c00: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c2e7fff8c10: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
=>0x0c2e7fff8c20: fd fd fd fd fd fd fd fd fd fd fd fd fd[fd]fd fd
  0x0c2e7fff8c30: fd fd fd fd fd fd fd fd fd fa fa fa fa fa fa fa
  0x0c2e7fff8c40: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c2e7fff8c50: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c2e7fff8c60: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c2e7fff8c70: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==27033==ABORTING
s

#15 Updated by stsp 21 days ago

This crash happens because handle_error in msc_mgcp.c seems to disregard
lifetime guarantees of the context.

Can someone fill me in on what's expected here?
Should handle_error set the free_ctx flag to false before triggering a transition to ST_HALT?

diff --git a/src/libmsc/msc_mgcp.c b/src/libmsc/msc_mgcp.c
index e58b24903..1ef844757 100644
--- a/src/libmsc/msc_mgcp.c
+++ b/src/libmsc/msc_mgcp.c
@@ -203,14 +203,14 @@ static void _handle_error(struct mgcp_ctx *mgcp_ctx, enum msc_mgcp_cause_code ca
          * freed yet, we stil require the higher layers to call
          * msc_mgcp_call_release() */
         osmo_fsm_inst_state_chg(fi, ST_HALT, 0, 0);
-        osmo_fsm_inst_dispatch(fi, EV_TEARDOWN_ERROR, mgcp_ctx);
+        osmo_fsm_inst_dispatch(fi, EV_TEARDOWN_ERROR, mgcp_ctx); // <-- mcgp_ctx freed here
     }

     /* Request the higher layers (gsm_04_08.c) to release the call. If the
      * problem occured after msc_mgcp_call_release() was calls, remain
      * silent because we already got informed and the higher layers might
      * already freed their context information (trans). */
-    if (!mgcp_ctx->free_ctx) {
+    if (!mgcp_ctx->free_ctx) { // <-- mcgp_ctx dereferenced here
         mncc_set_cause(&mncc, GSM48_CAUSE_LOC_TRANS_NET,
                    GSM48_CC_CAUSE_RESOURCE_UNAVAIL);
         mncc_tx_to_cc(mgcp_ctx->trans->net, MNCC_REL_REQ, &mncc);

#16 Updated by stsp 21 days ago

A simple solution is to move this block of code upwards:
https://gerrit.osmocom.org/#/c/osmo-msc/+/11146

#17 Updated by stsp 21 days ago

Regarding the original crash which prompted this issue:

I have not yet managed to reproduce the problem.

The original problem occured when the MGW returned an error code, and I have not yet tried to reproduce this exact scenario. I have tried two similar cases, however:

  • Set up a call between two phones with mncc_test.py; This works, except that voice streams don't seem to work (voice works fine if I use internal MNCC handler). I wait until both phones are ringing, then Ctrl-C the mncc_test.py process. osmo-msc reports that the MNCC socket was closed, and cleanly terminates related transactions. Cleanup seems clean and there is no crash See attached log file osmo-msc-log-osmo-mgw-running.txt.
  • The same as above, except osmo-mgw is not running so we get a 'connection refused' error when trying to talk to osmo-mgw. Again, cleanup seems clean and there is no crash. See attached log file osmo-msc-log-osmo-mgw-not-running.txt.

osmo-msc doesn't crash in either case.

#18 Updated by stsp 17 days ago

  • Status changed from In Progress to Feedback

Setting this to 'feedback' for now because it seems that cleanup of transactions after broken MNCC connections is working OK now.
It's still unclear to me how to reproduce the original issue exactly but I suspect it is now fixed.

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)