Project

General

Profile

Actions

Bug #5982

open

conn->fi is NULL in gscon_bssmap_clear()

Added by keith 11 months ago. Updated 11 months ago.

Status:
Feedback
Priority:
Low
Assignee:
Category:
-
Target version:
-
Start date:
03/29/2023
Due date:
% Done:

0%

Spec Reference:

Description

not master, but I don't think anything relevant has changed, there is one SEGV fix (7a0bef1ae4784203bf5f93b2dc2c4138dcad9397) but my quick static analysis suggests it's not related.

Program terminated with signal SIGSEGV, Segmentation fault.

(gdb) bt
#0  0x0000558c248184b4 in gscon_bssmap_clear (conn=conn@entry=0x558c262a0410, cause=cause@entry=GSM0808_CAUSE_EQUIPMENT_FAILURE) at bsc_subscr_conn_fsm.c:151
#1  0x0000558c24819932 in gscon_forget_lchan (conn=conn@entry=0x558c262a0410, lchan=lchan@entry=0x7faef1906718) at bsc_subscr_conn_fsm.c:943
#2  0x0000558c2487f3cf in lchan_fsm_wait_rf_release_ack_onenter (fi=<optimized out>, prev_state=<optimized out>) at lchan_fsm.c:1429
#3  0x00007faef078b41b in ?? () from /usr/lib/x86_64-linux-gnu/libosmocore.so.19
#4  0x00007faef078bb1d in _osmo_fsm_inst_state_chg () from /usr/lib/x86_64-linux-gnu/libosmocore.so.19
#5  0x0000558c24871c56 in lchan_fsm_timer_cb (fi=0x558c26255430) at lchan_fsm.c:1810
#6  0x00007faef078d0f1 in ?? () from /usr/lib/x86_64-linux-gnu/libosmocore.so.19
#7  0x00007faef07861f6 in osmo_timers_update () from /usr/lib/x86_64-linux-gnu/libosmocore.so.19
#8  0x00007faef0786d25 in ?? () from /usr/lib/x86_64-linux-gnu/libosmocore.so.19
#9  0x00007faef0786db6 in osmo_select_main_ctx () from /usr/lib/x86_64-linux-gnu/libosmocore.so.19
#10 0x0000558c247dc486 in main (argc=<optimized out>, argv=<optimized out>) at osmo_bsc_main.c:1031

(gdb) p conn->fi
$4 = (struct osmo_fsm_inst *) 0x0

I don't have log at level DEBUG but this looks to be the trigger condition:

DRSL ERROR handover_fsm.c:1557 handover(intraBSC_msc0-conn1_subscr-IMSI-[redacted]-TMSI-0x213e241e)[0x558c262a2e70]{WAIT_LCHAN_ACTIVE}: (4-0-4-TCH_F-0-SPEECH_V1) --HO-> (0-0-4-TCH/F_TCH/H_SDCCH8_PDCH:PDCH-0) (subscr subscr-IMSI-[redacted]-TMSI-0x213e241e) HO-intraBSC: Handover failed in state WAIT_SCCP_RLSD, Connection released: Connection releasing in the middle of handover

I have four SEGV in the log over the last few days and all are preceded by this RSL ERROR, however in at least one of the core dumps, conn->fi is not NULL but the crash is the same, weird?

coredumps are available. ping me on IRC for access.

Actions #1

Updated by neels 11 months ago

  • Status changed from New to Feedback

Whenever we encounter segfaults like this, the backtrace is very good indication for the symptom, and it is easy to plaster over it with some NULL checks, but it is very hard to understand the actual root cause. Whenever possible, please also provide at least a "screenful" of the log output and/or pcap trace leading up to the segfault.

AFAICT conn->fi should never be NULL.
Or if it is null, then definitely no lchan should point at the conn anymore.
So I'm eager to find out which code path leads to this prohibited state.

I can see this context so far:
  • we are receiving a timeout on lchan, which changes lchan to state WAIT_RF_RELEASE_ACK
  • an intra-BSS handover was ongoing (BTS 0 to BTS 4), the handover is waiting for the target lchan to become available (WAIT_LCHAN_ACTIVE)
  • the subscriber conn is already in SCCP release (WAIT_SCCP_RLSD), i.e. we've either received BSSMAP Clear Command from MSC, or we were unable to send a Clear Request out to the MSC. Reason unknown.

Can you get me some more context leading up to a segfault like this, please?

Actions #2

Updated by neels 11 months ago

Ok, so i logged in on the system where the segfault happened

(gdb) bt
#0  0x0000558c248184b4 in gscon_bssmap_clear (conn=conn@entry=0x558c262a0410, cause=cause@entry=GSM0808_CAUSE_EQUIPMENT_FAILURE) at bsc_subscr_conn_fsm.c:151
#1  0x0000558c24819932 in gscon_forget_lchan (conn=conn@entry=0x558c262a0410, lchan=lchan@entry=0x7faef1906718) at bsc_subscr_conn_fsm.c:943
#2  0x0000558c2487f3cf in lchan_fsm_wait_rf_release_ack_onenter (fi=<optimized out>, prev_state=<optimized out>) at lchan_fsm.c:1429
#3  0x00007faef078b41b in ?? () from /usr/lib/x86_64-linux-gnu/libosmocore.so.19
#4  0x00007faef078bb1d in _osmo_fsm_inst_state_chg () from /usr/lib/x86_64-linux-gnu/libosmocore.so.19
#5  0x0000558c24871c56 in lchan_fsm_timer_cb (fi=0x558c26255430) at lchan_fsm.c:1810
#6  0x00007faef078d0f1 in ?? () from /usr/lib/x86_64-linux-gnu/libosmocore.so.19
#7  0x00007faef07861f6 in osmo_timers_update () from /usr/lib/x86_64-linux-gnu/libosmocore.so.19
#8  0x00007faef0786d25 in ?? () from /usr/lib/x86_64-linux-gnu/libosmocore.so.19
#9  0x00007faef0786db6 in osmo_select_main_ctx () from /usr/lib/x86_64-linux-gnu/libosmocore.so.19
#10 0x0000558c247dc486 in main (argc=<optimized out>, argv=<optimized out>) at osmo_bsc_main.c:1031
(gdb) frame 2
#2  0x0000558c2487f3cf in lchan_fsm_wait_rf_release_ack_onenter (fi=<optimized out>, prev_state=<optimized out>) at lchan_fsm.c:1429
1429    lchan_fsm.c: No such file or directory.

(gdb) p lchan->name
$2 = 0x558c26201840 "(bts=0,trx=0,ts=4,ss=0)" 

so it is about lchan 0-0-4.
The lchan FSM enters the state rf_release_ack, i.e. it will first send an RF Release request, and wait for its ack.
Problem is, it has a conn pointer, but conn->fi NULL.
The interesting question: how can conn->fi NULL.

Leading up to the crash, I see this log:

'grep bts0-trx0-ts4' shows that the lchan has only been enabled ten minutes prior:

Mar 29 08:40:25 nuyoo osmo-bsc[27155]: DNM INFO nm_channel_fsm.c:190 NM_CHAN_OP(bts0-trx0-ts4)[0x558c262018c0]{DISABLED_OFFLINE}: State change to ENABLED (no timeout)

grep '0-0-4' shows some things, but since the log level is NOTICE, the picture is incomplete. There seems to be a lot of activity, but the interesting detail about what state the lchan and conn are in is left out of the log:

Mar 29 08:48:33 nuyoo osmo-bsc[27155]: DRSL NOTICE handover_decision_2.c:2059 handover(intraBSC_msc0-conn1_subscr-IMSI-334020535019207-TMSI-0x213e241e)[0x558c262e2880]{WAIT_RR_HO_DETECT}: (4-0-4-TCH_F-0-SPEECH_V1) --HO-> (0-0-4-TCH/F_TCH/H_SDCCH8_PDCH:TCH/F-0) (subscr subscr-IMSI-334020535019207-TMSI-0x213e241e) HO-intraBSC: Failed, starting penalty timer (5 s)
Mar 29 08:48:33 nuyoo osmo-bsc[27155]: DCHAN NOTICE abis_rsl.c:2949 lchan(0-0-4-OSMO_DYN-0)[0x558c26255430]{WAIT_RF_RELEASE_ACK}: (type=TCH_F) Rx IPACC DLCX IND (cause=Normal event, unspecified [ 0f ])
Mar 29 08:48:51 nuyoo osmo-bsc[27155]: DCHAN NOTICE abis_rsl.c:2949 lchan(0-0-4-OSMO_DYN-0)[0x558c26255430]{WAIT_RF_RELEASE_ACK}: (type=TCH_F) Rx IPACC DLCX IND (cause=Normal event, unspecified [ 0f ])
Mar 29 08:49:53 nuyoo osmo-bsc[27155]: DCHAN NOTICE abis_rsl.c:2949 lchan(0-0-4-OSMO_DYN-0)[0x558c26255430]{WAIT_RF_RELEASE_ACK}: (type=TCH_F) Rx IPACC DLCX IND (cause=Normal event, unspecified [ 0f ])
Mar 29 08:50:00 nuyoo osmo-bsc[27155]: DRSL ERROR handover_fsm.c:1557 handover(intraBSC_msc0-conn1_subscr-IMSI-334020535019207-TMSI-0x213e241e)[0x558c262a2e70]{WAIT_LCHAN_ACTIVE}: (4-0-4-TCH_F-0-SPEECH_V1) --HO-> (0-0-4-TCH/F_TCH/H_SDCCH8_PDCH:PDCH-0) (subscr subscr-IMSI-334020535019207-TMSI-0x213e241e) HO-intraBSC: Handover failed in state WAIT_SCCP_RLSD, Connection released: Connection releasing in the middle of handover
Mar 29 08:50:00 nuyoo osmo-bsc[27155]: DRSL NOTICE handover_decision_2.c:2059 handover(intraBSC_msc0-conn1_subscr-IMSI-334020535019207-TMSI-0x213e241e)[0x558c262a2e70]{WAIT_LCHAN_ACTIVE}: (4-0-4-TCH_F-0-SPEECH_V1) --HO-> (0-0-4-TCH/F_TCH/H_SDCCH8_PDCH:PDCH-0) (subscr subscr-IMSI-334020535019207-TMSI-0x213e241e) HO-intraBSC: Failed, starting penalty timer (5 s)

So that's not much help.

looking at the lchan:

(gdb) p lchan->conn
$3 = (struct gsm_subscriber_connection *) 0x0

That's because we're after this step, which already set lchan->conn = NULL

        if (lchan->conn) {
                struct gsm_subscriber_connection *conn = lchan->conn;
       -->      lchan_forget_conn(lchan);
                gscon_forget_lchan(conn, lchan);
        }

So there was an lchan->conn != NULL.
frame#0 conn still points at it.

(gdb) frame 0
#0  0x0000558c248184b4 in gscon_bssmap_clear (conn=conn@entry=0x558c262a0410, cause=cause@entry=GSM0808_CAUSE_EQUIPMENT_FAILURE) at bsc_subscr_conn_fsm.c:151
151    bsc_subscr_conn_fsm.c: No such file or directory.
(gdb) p conn
$5 = (struct gsm_subscriber_connection *) 0x558c262a0410
(gdb) p *conn
$6 = {entry = {next = 0x0, prev = 0x0}, fi = 0x0, bsub = 0x0, network = 0x0, lchan = 0x0, [...]

This must be a stale conn pointer. The conn is neither in the list of conns (entry {0,0}), nor does it have a subscr conn FSM inst, nor a subscriber, not even a global gsm_network pointer.

How is that even possible. There never is a time when an existing conn has a fi NULL, and all conns are always listed in the global llist of conns, right up until talloc_free(). There must be a case where a conn gets discarded, but fails to remove itself from an lchan->conn pointer.

To find out how that happens, let's put some code in the gscon cleanup function:
When gscon cleanup is done, it should have made sure that no lchan still points at the conn.
So let's iterate all lchans everywhere, and if an lchan->conn still points at the conn, flag that red in the log.

Actions #3

Updated by keith 11 months ago

I just checked the entire network for recent coredumps of osmo-bsc. There are none except one other machine... unfortunately, it's the same SEGV (i was hoping we might blame hardware, but actually this is good that it is confirmed)

Unfortunately, as the config is standardised, I don't have any more than DCHAN NOTICE here either.

What's rather bizarre is that in this coredump, the conn has a fi in frame 0:

(gdb) frame 0 
#0  0x00005562b23dce03 in gscon_bssmap_clear (conn=conn@entry=0x5562b2b56fc0, cause=cause@entry=GSM0808_CAUSE_EQUIPMENT_FAILURE) at bsc_subscr_conn_fsm.c:160
160    in bsc_subscr_conn_fsm.c
(gdb) p conn
$10 = (struct gsm_subscriber_connection *) 0x5562b2b56fc0
(gdb) p conn->fi
$11 = (struct osmo_fsm_inst *) 0x5562b2af7450
(gdb) p *conn->fi
$13 = {list = {next = 0x5562b2ace130, prev = 0x5562b2ace130}, fsm = 0x5562b2ace100, id = 0x5562b2b22aa0 "p\202\026_\024\177", name = 0x0, priv = 0x0, log_level = 0, state = 0, 
  T = -1297121968, timer = {node = {rb_parent_color = 0, rb_right = 0x5562b2b19670, rb_left = 0x0}, list = {next = 0x7f145f1654c0, prev = 0x5562b2af7a90}, timeout = {tv_sec = 256, 
      tv_usec = 93882392999824}, active = 0, cb = 0x5562b2af7990, data = 0x0}, proc = {parent = 0x2, parent_term_event = 0, children = {next = 0x0, prev = 0x5562b2af73d0}, child = {
      next = 0x0, prev = 0x0}, terminating = false}}

wtf?

Actions #4

Updated by keith 11 months ago

This conn also looks rather suspect...

(gdb) p *conn
$4 = {entry = {next = 0x0, prev = 0x0}, fi = 0x5562b2af7450, bsub = 0x0, network = 0x0, lchan = 0x0, assignment = {req = {assign_for = ASSIGN_FOR_NONE, aoip = false, 
      msc_assigned_cic = 0, msc_rtp_addr = '\000' <repeats 32 times>, "\003\000\000\000\000\000\000\000\001\000\000\000\000", msc_rtp_port = 0, use_osmux = false, 
      osmux_cid = 10 '\n', n_ch_mode_rate = 0, ch_mode_rate_list = {{chan_mode = 2998235208, chan_rate = (CH_RATE_FULL | unknown: 21856), s15_s0 = 29300, csd_mode = 21858}, {
          chan_mode = 2998235720, chan_rate = (CH_RATE_FULL | unknown: 21856), s15_s0 = 0, csd_mode = LCHAN_CSD_M_NT}, {chan_mode = GSM48_CMODE_SIGN, chan_rate = CH_RATE_SDCCH, 
          s15_s0 = 0, csd_mode = LCHAN_CSD_M_NT}}, target_lchan = 0x0, tsc_set = {present = false, val = 0}, tsc = {present = false, val = 0}}, requires_voice_stream = false, etc etc etc....

Also, I guess I'm not sure, but given the line number gdb tells us the segv happened at, is it in this case

conn->clear_cause = cause;

that is the violation?

I don't know how a stale pointer happens or what that really means internally, but I've seen this kind of thing before, where I could not explain a SEGV as the pointers look like valid addresses, but accessing them crashes the program.

Actions #5

Updated by keith 11 months ago

neels wrote in #note-2:

'grep bts0-trx0-ts4' shows that the lchan has only been enabled ten minutes prior:

Yes, look closely at the coredump list, you'll notice the exact same bug was hit 10 minutes previously, so osmo-bsc has only been running 10 mins.

Actions #6

Updated by neels 11 months ago

I've added a patch that checks cleanup of lchan->conn pointers here [1],
and run the entire BSC_Tests.ttcn test suite without hitting any stale lchan->conn pointers.
So next up I would suggest we somehow bring this patch to that site and run osmo-bsc with that for some time.

[1]:
osmo-bsc.git branch neels/os5982
https://cgit.osmocom.org/osmo-bsc/log/?h=neels/os5982
(the patch should cleanly cherry-pick onto rhizo/production_latest)

about rolling out that patch i'm not sure.
I could manually build osmo-bsc.git -- but I don't want to mess up the system.
I could push the patch to a specific branch,
and then we somehow tell puppet to only deploy that branch on the one site;
or we're fine with running it everywhere;
... keith, so many options, which way to go?

Via VTY and 'write file', I changed the logging on nuyoo to:
logging level set-all info
logging level hodec notice
logging level pag notice

I don't know how a stale pointer happens or what that really means internally

Say I have a variable foo pointing at an allocated object,
so then if I free the allocated object, I have to set foo = NULL, because the object is no longer there.
It's like someone lived in a house, but when the person moved out, all friends need to scrap the old address they had.
A stale pointer happens when foo still points at the memory that was freed, where the object once was.
It looks like there should be an object there, but by now who knows what is happening at that memory location.
Like the friend still visits the old house, but who knows who lives there now, or if there even is a house there anymore.
Sometimes they end up a SEGV when the address space is outside all valid pages,
and sometimes there's now a different valid object at that memory location, and we read its unrelated bytes or even overwrite that by accident.
The address sanitizer build (--enable-sanitize) makes it so that all access of freed mem aborts the program, not only when the address is outside of valid memory; but it causes the program to run somewhat slower.

Actions #7

Updated by keith 11 months ago

neels wrote in #note-6:

and then we somehow tell puppet to only deploy that branch on the one site;

just FYI, that's not a problem, although it's not actually implemented. (it would take 10 mins to implement) but what is implemented is telling an individual site it should run osmo-bsc and/or osmo-msc from /usr/local/bin so maybe the easiest thing for me to do is build an osmo-bsc with your patch on the build machine where I made the current package, that builds against the very same lib packages also built there, and copy it manually to /usr/local/bin

or we're fine with running it everywhere;

Also a possibility.

... keith, so many options, which way to go?

Via VTY and 'write file', I changed the logging on nuyoo to:
logging level set-all info
logging level hodec notice
logging level pag notice

good. i had already changed logging of just chan to info,but this is also fine. again FYI, "write" is not really necessary and come to think of it, is bad, as puppet will notice the change and revert it within the next 30 mins, and puppet will also write a semaphore file to inform that "something" changed in osmo-bsc and osmo-bsc should therefore be restarted when opportune, which will be the first moment after 2.30AM local time when there are no calls in progress. I will delete the semaphore manually this time so puppet does not restart osmo-bsc.

Although, if I get around to deploying the patch tonight, then we want it to restart.. If I get a chance to do that without dropping calls on people I'll do it.

Say I have a variable foo pointing at an allocated object,..

Ah yes.. thanks for that very concise and easy to understand explanation. I guess it's totally random whether the address space is outside valid pages or not, which explains what I saw before about a bug that happened (like this one, once in a blue moon), UNLESS built with AS, then it would crash almost immediately.. which makes me wonder, should I build osmo-bsc now with the AS turned on?

Actions #8

Updated by neels 11 months ago

On Fri, Mar 31, 2023 at 02:40:06AM +0000, keith wrote:

which makes me wonder, should I build osmo-bsc now with the AS turned on?

I would say yes, because I am a developer and want to find all the bugs, and
I'd argue if an address error occurrs, and it goes undetected, the program may
continue in mysterious and random ways. So if we suspect there are still
errors, it is better for long term system stability to hit them and fix, than
to let them linger. The pretext is that we are willing to hit crashes and see
service disruption now, and then fix them, and be happy later on.

Counter arguments are, I am told some other bugs are covered up by the address
sanitizer.

And also address sanitizer binaries are a bit slower, so if milking for peak
CPU performance, then better no sanitizer build. But I guess at rhz, the
bottlenecks are elsewhere than CPU load?

Actions #9

Updated by keith 11 months ago

so, the address-sanitized binary with patch has been running now since Fri 31 Mar.

We do have some logs from the patch:

Apr 04 20:00:02 nuyoo osmo-bsc[937]: DCHAN ERROR bsc_subscr_conn_fsm.c:1066 lchan(1-0-5-OSMO_DYN-0)[0x612000039fa0]{WAIT_BEFORE_RF_RELEASE}: (type=TCH_F) This lchan still points at discarded conn SUBSCR_CONN(msc0-conn127870_subscr-IMSI-33402XXXXXXXXXX-TMSI-0xa0ef5757)[0x612000187ea0]
Apr 04 20:00:02 nuyoo osmo-bsc[937]: DMSC ERROR bsc_subscr_conn_fsm.c:1069 SUBSCR_CONN(msc0-conn127870_subscr-IMSI-33402XXXXXXXXXX-TMSI-0xa0ef5757)[0x612000187ea0]{WAIT_SCCP_RLSD}: This conn is discarding, but lchan is still pointing at this conn: (bts=1,trx=0,ts=5,ss=0)

Maybe neels would like to log in and grab some lead-up log

Actions #10

Updated by neels 11 months ago

  • Assignee changed from keith to neels
Actions

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)