Project

General

Profile

Actions

Bug #6359

closed

TbfTest: sporadic SEGV in

Added by pespin 2 months ago. Updated 16 days ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
-
Start date:
02/13/2024
Due date:
% Done:

100%

Spec Reference:

Description

This error showed up today in raspbian11 test:
https://jenkins.osmocom.org/jenkins/job/master-osmo-pcu/FIRMWARE_VERSION=master,WITH_MANUALS=0,label=rpi4-raspbian11,with_dsp=none,with_vty=False/6448/console

../../../tests/testsuite.at:28: $OSMO_QEMU $abs_top_builddir/tests/tbf/TbfTest
--- experr    2024-02-13 07:37:47.768562611 +0000
+++ /build/osmo-pcu-1.4.0.1-b04e1/_build/sub/tests/testsuite.dir/at-groups/4/stderr    2024-02-13 07:37:48.558541306 +0000
@@ -11580,18 +11580,16 @@
 TBF(DL:TFI-0-0-0:G:IMSI-001001000000001:TLLI-0xecc1f953){ASSIGN} Starting timer X2001 [assignment (PACCH)] with 2 sec. 0 microsec
 DL_TBF(DL:TFI-0-0-0:G:IMSI-001001000000001:TLLI-0xecc1f953){ASSIGN}: Received Event ASSIGN_PCUIF_CNF
 TBF(DL:TFI-0-0-0:G:IMSI-001001000000001:TLLI-0xecc1f953){ASSIGN} Ignoring event ASSIGN_PCUIF_CNF from BTS (CCCH was not requested on current assignment)
-DL_ASS_TBF(UL:TFI-0-0-0:G:IMSI-001001000000001:TLLI-0xecc1f953){SEND_ASS}: Received Event CREATE_RLCMAC_MSG
-PDCH(bts=0,trx=0,ts=2) POLL scheduled at FN 0 + 13 = 13
-TBF(UL:TFI-0-0-0:G:IMSI-001001000000001:TLLI-0xecc1f953){ASSIGN} start Packet Downlink Assignment (PACCH) for TBF(DL:TFI-0-0-0:G:IMSI-001001000000001:TLLI-0xecc1f953){ASSIGN}
-+++++++++++++++++++++++++ TX : Packet Downlink Assignment +++++++++++++++++++++++++
-------------------------- TX : Packet Downlink Assignment -------------------------
-PDCH(bts=0,trx=0,ts=2) Reserving FN 13 for type POLL
-TBF(UL:TFI-0-0-0:G:IMSI-001001000000001:TLLI-0xecc1f953){ASSIGN} Scheduled DL Assignment polling on PACCH (FN=13, TS=2)
-DL_ASS_TBF(UL:TFI-0-0-0:G:IMSI-001001000000001:TLLI-0xecc1f953){SEND_ASS}: state_chg to WAIT_ACK
-PDCH(bts=0,trx=0,ts=2) FN=0 Scheduling control message at RTS for TBF(UL:TFI-0-0-0:G:IMSI-001001000000001:TLLI-0xecc1f953){ASSIGN}
-=== end test_ms_merge_dl_tbf_different_trx ===
-MS(IMSI-001001000000001:TLLI-0xecc1f953:TA-220:MSCLS-11-0:UL:DL) Destroying MS object
-MS(IMSI-001001000000001:TLLI-0xecc1f953:TA-220:MSCLS-11-0:UL:DL) Detaching TBF: TBF(UL:TFI-0-0-0:G:IMSI-001001000000001:TLLI-0xecc1f953){ASSIGN}
-MS(IMSI-001001000000001:TLLI-0xecc1f953:TA-220:MSCLS-11-0:DL): - tbf: now used by 1 (tbf)
-MS(IMSI-001001000000001:TLLI-0xecc1f953:TA-220:MSCLS-11-0:DL) Detaching TBF: TBF(DL:TFI-0-0-0:G:IMSI-001001000000001:TLLI-0xecc1f953){ASSIGN}
-MS(IMSI-001001000000001:TLLI-0xecc1f953:TA-220:MSCLS-11-0): - tbf: now used by 0 (-)
+UL_TBF(UL:TFI-0-0-0:G){ASSIGN}: Timeout of X2002
+UL_TBF(UL:TFI-0-0-0:G){ASSIGN}: Received Event ASSIGN_READY_CCCH
+../../../src/tbf_ul_fsm.c:173:3: runtime error: member access within null pointer of type 'struct GprsMs'
+AddressSanitizer:DEADLYSIGNAL
+=================================================================
+==1984==ERROR: AddressSanitizer: SEGV on unknown address 0x0000000c (pc 0x00687b8e bp 0x1d52b65d sp 0xbeeb12b0 T0)
+==1984==The signal is caused by a READ memory access.
+==1984==Hint: address points to the zero page.
+    #0 0x687b8e in st_assign (/build/osmo-pcu-1.4.0.1-b04e1/_build/sub/tests/tbf/TbfTest+0x227b8e)
+
+AddressSanitizer can not provide additional info.
+SUMMARY: AddressSanitizer: SEGV (/build/osmo-pcu-1.4.0.1-b04e1/_build/sub/tests/tbf/TbfTest+0x227b8e) in st_assign
+==1984==ABORTING
stdout:
../../../tests/testsuite.at:28: exit code was 1, expected 0
4. testsuite.at:25: 4. tbf (testsuite.at:25): FAILED (testsuite.at:28)

At first sight it seems that something took longer than usual to run (due to load?) and then some event triggered which caused a SEGV.

I marked the job above as "Keep this build forever". It can be unmarked once this issue is looked at.
I also upload here the build artifacts of the job just in case.


Files

archive.zip archive.zip 16.5 KB pespin, 02/13/2024 09:27 AM
Actions #1

Updated by pespin 2 months ago

Current osmo-pcu.git master at this time is: b04e1d7d0ce1ca95a0c7e9835478eb751bd47669

Actions #2

Updated by pespin 2 months ago

Seems its failing when doing:

#define tbf_ul_fsm_state_chg(fi, NEXT_STATE) \
    osmo_tdef_fsm_inst_state_chg(fi, NEXT_STATE, \
                     tbf_ul_fsm_timeouts, \
                     tbf_ms(((struct tbf_dl_fsm_ctx *)(fi->priv))->tbf)->bts->T_defs_bts, \
                     -1)

static void st_assign(struct osmo_fsm_inst *fi, uint32_t event, void *data)
{
    case TBF_EV_ASSIGN_READY_CCCH:
        /* change state to FLOW, so scheduler will start requesting USF */
        tbf_ul_fsm_state_chg(fi, TBF_ST_FLOW);

Fails when accessing "tbf_ms(((struct tbf_dl_fsm_ctx *)(fi->priv))", which returns a NULL pointer.

The cast there is wrong, I fixed it here: https://gerrit.osmocom.org/c/osmo-pcu/+/35957 tbf_ul_fsm: Fix wrong cast

However, that's not really the culprit (tbf is at the same offset in both).

Actions #3

Updated by pespin 2 months ago

This segfault happened again today in master-osmo-pcu jenkins job.

Actions #4

Updated by pespin about 1 month ago

With the following change, the crash can be reproduced easily:

diff --git a/tests/tbf/TbfTest.cpp b/tests/tbf/TbfTest.cpp
index 643551b9..b2650518 100644
--- a/tests/tbf/TbfTest.cpp
+++ b/tests/tbf/TbfTest.cpp
@@ -2524,6 +2524,7 @@ static void test_ms_merge_dl_tbf_different_trx(void)
        OSMO_ASSERT(osmo_tdef_set(the_pcu->T_defs, -2002, 0, OSMO_TDEF_MS) == 0);
        osmo_fsm_inst_dispatch(ms_dl_tbf(second_ms)->state_fi, TBF_EV_ASSIGN_PCUIF_CNF, NULL);
        osmo_select_main(0);
+       osmo_select_main(0);

        /* get the PACCH PktDlAss for the DL-TBF, allocated on the UL-TBF from the new MS obj: */
        request_dl_rlc_block(dl_tbf->bts, dl_tbf->control_ts, &fn);

Actions #5

Updated by pespin about 1 month ago

The UL TBF crashing is actually coming from another test, because in the test "test_ms_merge_dl_tbf_different_trx" the only UL TBF created happens through PACCH, and X2002 is only used during CCCH assignment.

By commenting all tests and running test_ms_merge_dl_tbf_different_trx() alone, it can be seen that the crash doesn't show up.

After several attempts, the one producing the crash later on when test_ms_merge_dl_tbf_different_trx blocks in osmo_main_select() is:

test_immediate_assign_rej();

Actions #6

Updated by pespin about 1 month ago

The problem seems triggered by the fact that test_immediate_assign_rej_multi_block() is calling "rc = bts_handle_rach(bts, 0x78, rach_fn, qta);", which create a UL TBF, but the UL TBF is never explicitly freed through procedures.

Instead, it was expecting "TALLOC_FREE(the_pcu);" at the end to free it.
However, when tearing down the whole tree of objects, ms_talloc_destructor() is unattaching the TBFs but not freeing them, which seems a bug.

Actions #7

Updated by pespin about 1 month ago

  • Status changed from New to Feedback
  • % Done changed from 0 to 90

Should be fixed by:
https://gerrit.osmocom.org/c/osmo-pcu/+/36450 Make sure GprsMs free() also frees its tbfs

I run the whole ttcn3 PCU_Tests testsuite with the patch applied and it looked good.

Once merged, we can close the ticket.

Actions #8

Updated by pespin 16 days ago

  • Status changed from Feedback to Resolved
  • % Done changed from 90 to 100

Merged a few days ago, closing.

Actions

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)