Project

General

Profile

Actions

Bug #2881

closed

OsmoMSC doesn't release call if BSS-side CRCX is never responded to

Added by laforge about 6 years ago. Updated almost 6 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
MGCP towards MGW
Target version:
-
Start date:
01/26/2018
Due date:
% Done:

100%

Resolution:
Spec Reference:

Description

If the CRCX is not responded to, we get some MGCP debug log, but the call is not released:

<0007> fsm.c:182 MGW(MGW_8)[0xa567850]{ST_HALT}: Timeout of T3
<0007> msc_mgcp.c:203 MGW(MGW_8)[0xa567850]{ST_HALT}: state_chg to ST_HALT
<0007> msc_mgcp.c:204 MGW(MGW_8)[0xa567850]{ST_HALT}: Received Event 3
<0007> fsm.c:287 MGW(MGW_8)[0xa567850]{ST_HALT}: Deallocated

See MSC_Tests.TC_mo_crcx_ran_timeout


Files

20180126-osmomsc-mocall-ran-crcx-timeout.pcap 20180126-osmomsc-mocall-ran-crcx-timeout.pcap 2.98 KB pcap file showing the problem laforge, 01/26/2018 09:35 PM
Actions #1

Updated by laforge about 6 years ago

  • Priority changed from Normal to High

To make things worse, if the MNCC handler disappears (manual termination of the test suiet), OsmoMSC crashes

<0004> mncc_sock.c:85 MNCC Socket has LOST connection
<0001> gsm_04_08.c:191 Clearing all currently active transactions!!!
==6352== Invalid read of size 8
==6352==    at 0x128B6A: msc_mgcp_call_release (msc_mgcp.c:1052)
==6352==    by 0x11ED50: _gsm48_cc_trans_free (gsm_04_08.c:1419)
==6352==    by 0x12BF94: trans_free (transaction.c:123)
==6352==    by 0x11CFEA: gsm0408_clear_all_trans (gsm_04_08.c:196)
==6352==    by 0x125A07: mncc_sock_close (mncc_sock.c:95)
==6352==    by 0x125B1E: mncc_sock_read (mncc_sock.c:140)
==6352==    by 0x125B1E: mncc_sock_cb (mncc_sock.c:198)
==6352==    by 0x56D0950: osmo_fd_disp_fds (select.c:216)
==6352==    by 0x56D0950: osmo_select_main (select.c:256)
==6352==    by 0x11371B: main (msc_main.c:546)
==6352==  Address 0xa567740 is 96 bytes inside a block of size 200 free'd
==6352==    at 0x4C2DDBB: free (vg_replace_malloc.c:530)
==6352==    by 0x505BE82: _talloc_free (in /usr/lib/x86_64-linux-gnu/libtalloc.so.2.1.10)
==6352==    by 0x56D3C8E: _osmo_fsm_inst_dispatch (fsm.c:450)
==6352==    by 0x12830B: fsm_timeout_cb (msc_mgcp.c:204)
==6352==    by 0x56D4458: fsm_tmr_cb (fsm.c:185)
==6352==    by 0x56D0305: osmo_timers_update (timer.c:257)
==6352==    by 0x56D0904: osmo_select_main (select.c:253)
==6352==    by 0x11371B: main (msc_main.c:546)
==6352==  Block was alloc'd at
==6352==    at 0x4C2CB8F: malloc (vg_replace_malloc.c:299)
==6352==    by 0x505E150: _talloc_zero (in /usr/lib/x86_64-linux-gnu/libtalloc.so.2.1.10)
==6352==    by 0x128448: msc_mgcp_call_assignment (msc_mgcp.c:902)
==6352==    by 0x11B59F: gsm48_cc_tx_call_proc_and_assign (gsm_04_08.c:1889)
==6352==    by 0x11F1C6: mncc_tx_to_cc (gsm_04_08.c:3145)
==6352==    by 0x125C9F: mncc_sock_read (mncc_sock.c:130)
==6352==    by 0x125C9F: mncc_sock_cb (mncc_sock.c:198)
==6352==    by 0x56D0950: osmo_fd_disp_fds (select.c:216)
==6352==    by 0x56D0950: osmo_select_main (select.c:256)
==6352==    by 0x11371B: main (msc_main.c:546)
Actions #2

Updated by dexter about 6 years ago

  • Status changed from New to In Progress
  • % Done changed from 0 to 80

I managed to pinpoint the problem. The problem was that in case of an MGW timeout no action is taken to tell the upper layers that something went wrong. So the MGW could time out while the gsm_04_08 continues to think everything would be all right. I have restructured the error handling a bit so that this case is now also covered, which means it now generates a release request, which then is acknowledged by the BSC? MS? and will then eventually cause a clear command to be set off.

Unfortunately the TTCN3 testcase does not handle the release request, which leads us to the next bug. The problem here is that when the CC times out on the MSC side, the trans is freed, but no action concerning clearing the connection is taken. This is now also fixed. We now clear when nobody answers the release request.

I think the TTCN3 test should deal with the release request, so I will add that now. I am a bit confused about


template PDU_ML3_NW_MS tr_ML3_MT_CC_RELEASE(integer tid) := {
    discriminator := '0011'B,
    tiOrSkip := {
        transactionId := {
            tio := int2bit(tid, 3),
            tiFlag := ?,
            tIExtension := omit
        }
    },
    msgs := {
        cc := {
            release_NW_MS := {
                messageType := '101101'B,
                nsd := '00'B,
                cause := ?,
                secondCause := *,
                facility := *,
                user_user := *
            }
        }
    }
}

in L3_Templates.ttcn

There the cause code is specified as "?", which means it must not be omitted but the release request I see from the MSC lacks a cause code. By 3GPP TS 24.008, Table 9.68a/3GPP TS 24.008, the cause code is optional. In 9.3.18.2.1 I can read "This information element shall be included if this message is used to initiate call clearing." so I think that in that special case it is indeed not optional, since we use it to initiate a call clearing. So that means that L3_Templates.ttcn is correct, and we need to fix this in the MSC.

Actions #3

Updated by dexter about 6 years ago

  • % Done changed from 80 to 100

The problem is now fixed, see patches:

https://gerrit.osmocom.org/7277 msc_mgcp: fix mgw timeout handling
https://gerrit.osmocom.org/7278 gsm_04_08: clear SCCP connection when release is not acknowledged

Also the TTCN3 test had to be fixed since the MSC first sends a release request before it clears the SCCP connection:

https://gerrit.osmocom.org/7280 MSC_Tests: Respond to BSSMAP rlease

Actions #4

Updated by laforge about 6 years ago

On Tue, Mar 13, 2018 at 05:37:42PM +0000, dexter [REDMINE] wrote:

There the cause code is specified as "?", which means it must not be omitted but the release request I see from the MSC lacks a cause code. By 3GPP TS 24.008, Table 9.68a/3GPP TS 24.008, the cause code is optional. In 9.3.18.2.1 I can read "This information element shall be included if this message is used to initiate call clearing." so I think that in that special case it is indeed not optional, since we use it to initiate a call clearing. So that means that L3_Templates.ttcn is correct, and we need to fix this in the MSC.

I agree, as per your quote it is not optional in this situation.

Actions #5

Updated by dexter about 6 years ago

I agreed with neels that he should have a look at the problem with the missing connection clearing. I have now abandonned https://gerrit.osmocom.org/7278, since I found out that the error situation somehow causes the usage counter not reaching 0. Thats why it is not clearing the connection. To make debugging easier i created a TTCN3 test that provokes the issue: https://gerrit.osmocom.org/#/c/7319/

Unfortunately https://gerrit.osmocom.org/#/c/7277/ introduced a use after free, which is now fixed by https://gerrit.osmocom.org/7325 (I have a TTCN3 test for this which is almost done, but I am having some problems with this. I will push it as soon it is ready.)

Actions #6

Updated by dexter about 6 years ago

  • Assignee changed from dexter to neels
Actions #7

Updated by dexter about 6 years ago

There is now a TTCN3 testcase that provokes the use after free situation by not responding to DLCX at the end of the call. https://gerrit.osmocom.org/#/c/7390/ I think we should add some more of those traps, we should also see what happens when other MGCP commands are not answered.

Actions #8

Updated by laforge about 6 years ago

On Mon, Mar 19, 2018 at 03:15:58PM +0000, dexter [REDMINE] wrote:

There is now a TTCN3 testcase that provokes the use after free situation by not responding to DLCX at the end of the call. https://gerrit.osmocom.org/#/c/7390/

yay!

I think we should add some more of those traps, we should also see what happens when other MGCP commands are not answered.

sure. but let's try to fix the currently known bug[s] first? After all,
it's more important to fix one crash than to know about ten :)

Actions #9

Updated by dexter almost 6 years ago

  • Status changed from In Progress to Resolved

I have re-tested on the problem with the crash when the MNCC socket vanishes. When I manually test with an external MNCC I it looks ok to me. I can stop osmo-sip-connector at any time and it does not crash. When I stop it while I am in a call, the call gets shut down. I also ran the TTCN3 test and terminated the testsuite on random times. Also this did not lead into a crash. From that perspective I would say that we can set this to resolved.

Actions #10

Updated by neels almost 6 years ago

Above test is MSC_Tests.TC_lu_and_mt_call_no_dlcx_resp
It passes in current master,
and still passes after the MSC's FSM refactoring in https://gerrit.osmocom.org/#/q/status:open+project:osmo-msc+branch:master+topic:fsm_refactor

Actions

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)