Bug #3713
closedosmo-bsc abort()s in ttcn3-bsc-sccplite TC_ho_int testcase
100%
Description
I can not reproduce this if TC_ho_int is the first test I run. Running the whole suite with docker does cause this issue so it seems some lingering state triggers this.
The core file shows the following:
(gdb) bt full #0 __GI_abort () at abort.c:125 act = {__sigaction_handler = {sa_handler = 0x0, sa_sigaction = 0x0}, sa_mask = {__val = {18446744073709551615 <repeats 16 times>}}, sa_flags = 0, sa_restorer = 0x0} sigs = {__val = {32, 0 <repeats 15 times>}} #1 0x00007f8033abd020 in osmo_panic () from /usr/lib/x86_64-linux-gnu/libosmocore.so.11 No symbol table info available. #2 0x0000000000432dca in conn_get_bts (conn=<optimized out>) at ../../include/osmocom/bsc/gsm_data.h:1242 No locals. #3 ho_or_as (vty=0xf0d2d0, argv=0xf0dcb0, argc=6) at bsc_vty.c:1668 conn = 0xf0dcb0 new_bts = 0x7fff8a84a590 bts_nr = 0 trx_nr = 0 ss_nr = 856498279 bts_nr_new = 514 action = 0x6 <error: Cannot access memory at address 0x6> #4 0x00007f80343567c5 in ?? () from /usr/lib/x86_64-linux-gnu/libosmovty.so.4 No symbol table info available. #5 0x00007f803435b429 in vty_read () from /usr/lib/x86_64-linux-gnu/libosmovty.so.4 No symbol table info available. #6 0x00007f803435d4d9 in ?? () from /usr/lib/x86_64-linux-gnu/libosmovty.so.4 No symbol table info available. #7 0x00007f8033ab38f0 in osmo_select_main () from /usr/lib/x86_64-linux-gnu/libosmocore.so.11 No symbol table info available.
The code in ho_or_as iterates through the subscr_conns and just assumes there is an lchan present for all of them. I guess this is no longer the case and we should prepend a conn->lchan && to the if statement?
Files
Updated by neels over 5 years ago
- Status changed from New to Feedback
- Assignee changed from neels to daniel
hmm, the conclusion seems right, but let's first try to find out which conn it is that has no lchan, and whether that makes sense at all.
The fact that this testcase alone cannot trigger the failure indicates that some previous test leaves a conn lingering behind.
On branch neels/hack I pushed a patch that might help to find out:
commit e92ce3f0c6a03084c9f1037204c94c9a44c37bdd (origin/neels/hack, neels/hack) Author: Neels Hofmeyr <neels@hofmeyr.de> Date: Tue Nov 27 18:19:33 2018 +0100 NOT FOR MERGE: try to find out why there is a conn without an lchan in ttcn3 tests, we currently face a crash where the vty 'handover any' command assumes all conns have an lchan. Don't access a NULL lchan. But first we want to find out why there is a NULL lchan, so log a visible error. Change-Id: If5909b3890cf3f6327b271e53e458a417cebf73f diff --git a/src/osmo-bsc/bsc_vty.c b/src/osmo-bsc/bsc_vty.c index 983dcb9b3..947ba67df 100644 --- a/src/osmo-bsc/bsc_vty.c +++ b/src/osmo-bsc/bsc_vty.c @@ -1665,6 +1665,11 @@ static int ho_or_as(struct vty *vty, const char *argv[], int argc) /* Find the connection/lchan that we want to handover */ llist_for_each_entry(conn, &net->subscr_conns, entry) { + if (!conn->lchan) { + LOGP(DHO, LOGL_ERROR, "XXXXX found a conn without an lchan: %s %s\n", + osmo_fsm_inst_name(conn->fi), bsc_subscr_name(conn->bsub)); + continue; + } if (conn_get_bts(conn)->nr == bts_nr && conn->lchan->ts->trx->nr == trx_nr && conn->lchan->ts->nr == ts_nr && conn->lchan->nr == ss_nr) {
daniel can you try to reproduce the problem with this patch applied? Then we might see an error log with XXXXX telling us the FSM instance name of a conn, and hence we might be able to associate that with some earlier logging from osmo-bsc to find out why that conn is still around.
Updated by daniel over 5 years ago
- File osmo-bsc.log osmo-bsc.log added
- File BSC_Tests.TC_paging_resp_unsol.merged BSC_Tests.TC_paging_resp_unsol.merged added
- File BSC_Tests.TC_paging_resp_unsol.pcap BSC_Tests.TC_paging_resp_unsol.pcap added
- Assignee changed from daniel to neels
With your patch the only failing test is currently TC_paging_resp_unsol:
f_expect_chan_rel(): expect_rll_rel_req=true got_rll_rel_req=false
I can see the log message you added:
Wed Nov 28 14:37:47 2018 DHO <0008> bsc_vty.c:1670 XXXXX found a conn without an lchan: SUBSCR_CONN[0x21d7010] unknown
Lines 32073 of osmo-bsc.log show where this came from:
32073 Wed Nov 28 14:36:11 2018 DRSL <0003> osmo_bsc_main.c:283 bootstrapping RSL for BTS/TRX (0/0) on ARFCN 871 using MCC-MNC 001-01 LAC=1 CID=0 32073 BSIC=10 32074 Wed Nov 28 14:36:12 2018 DRSL <0003> abis_rsl.c:1357 (bts=0) CHAN RQD: reason: answer to paging (ra=0x23, neci=0x01, chreq_reason=0x01) 32075 Wed Nov 28 14:36:12 2018 DMSC <0007> fsm.c:299 SUBSCR_CONN[0x21d7010]{INIT}: Allocated 32076 Wed Nov 28 14:36:12 2018 DLCLS <000f> fsm.c:299 LCLS[0x21d7140]{NO_LCLS}: Allocated 32077 Wed Nov 28 14:36:12 2018 DLCLS <000f> fsm.c:329 LCLS[0x21d7140]{NO_LCLS}: is child of SUBSCR_CONN[0x21d7010] 32078 Wed Nov 28 14:36:12 2018 DMSC <0007> gsm_08_08.c:328 Got paged but no subscriber found. 32079 Wed Nov 28 14:36:12 2018 DMSC <0007> gsm_08_08.c:363 Failed to find a MSC for a connection. 32080 Wed Nov 28 14:36:12 2018 DMSC <0007> lchan_fsm.c:1390 SUBSCR_CONN[0x21d7010]{INIT}: lchan lchan(0-0-0-CCCH_SDCCH4-0)[0x21d8a10] detaches from conn
The failing test TC_paging_resp_unsol seems to be responsible for the that.
Updated by neels almost 5 years ago
It seems I have forgotten about this issue; all I know is that currently this test passes for osmo-bsc master
Updated by neels almost 5 years ago
- Status changed from Feedback to Resolved
- % Done changed from 0 to 100