Project

General

Profile

Bug #4817

gtp_delete_context_req2() called with NULL pctx->ggsn

Added by keith about 1 month ago. Updated about 1 month ago.

Status:
Feedback
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
10/18/2020
Due date:
% Done:

70%

Spec Reference:

Description

SGSN in production still seems to be crashing often. I enabled core dumps on one system and caught this Backtrace:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00005627bb8c9c7d in sgsn_delete_pdp_ctx (pctx=0x5627bdafacc0) at sgsn_libgtp.c:321
321    sgsn_libgtp.c: No such file or directory.
(gdb) bt
#0  0x00005627bb8c9c7d in sgsn_delete_pdp_ctx (pctx=0x5627bdafacc0) at sgsn_libgtp.c:321
#1  0x00005627bb8be8ff in gsm48_rx_gsm_deact_pdp_req (msg=0x0, mm=0x5627bdb5bb30) at gprs_sm.c:642
#2  gsm0408_rcv_gsm (mmctx=mmctx@entry=0x5627bdb5bb30, msg=msg@entry=0x5627bda33550, llme=llme@entry=0x5627bdb30c20) at gprs_sm.c:730
#3  0x00005627bb8b414e in gsm0408_gprs_rcvmsg_gb (msg=msg@entry=0x5627bda33550, llme=0x5627bdb30c20, drop_cipherable=drop_cipherable@entry=false) at gprs_gb.c:69
#4  0x00005627bb8cbe95 in gprs_llc_rcvmsg (msg=0x5627bda33550, tv=<optimized out>) at gprs_llc.c:1011
#5  0x00007f3b2ae8dc51 in bssgp_rcvmsg () from /lib/x86_64-linux-gnu/libosmogb.so.11
#6  0x00007f3b2ae87744 in ?? () from /lib/x86_64-linux-gnu/libosmogb.so.11
#7  0x00007f3b2ae8936a in gprs_ns_rcvmsg () from /lib/x86_64-linux-gnu/libosmogb.so.11
#8  0x00007f3b2ae894d9 in ?? () from /lib/x86_64-linux-gnu/libosmogb.so.11
#9  0x00007f3b2ae2f514 in ?? () from /lib/x86_64-linux-gnu/libosmocore.so.16
#10 0x00007f3b2ae2fc26 in osmo_select_main () from /lib/x86_64-linux-gnu/libosmocore.so.16
#11 0x00005627bb8b3d6a in main (argc=<optimized out>, argv=0x7fffba9dbbd8) at sgsn_main.c:556
(gdb) p *pctx
$1 = {list = {next = 0x5627bdb5bcc8, prev = 0x5627bdb5bcc8}, g_list = {next = 0x5627bda82940, prev = 0x5627bdb736f0}, mm = 0x5627bdb5bb30, destroy_ggsn = 0, ggsn = 0x0,
  ggsn_list = {next = 0x100100, prev = 0x200200}, ctrg = 0x5627bd99ca50, lib = 0x0, state = PDP_STATE_CR_REQ, type = PDP_TYPE_NONE, address = 0, apn_subscribed = 0x0, 
  nsapi = 5, sapi = 11, ti = 0 '\000', vplmn_allowed = 0, qos_profile_subscr = 0, radio_prio = 0 '\000', timer = {node = {rb_parent_color = 94728684322873, 
      rb_right = 0x5627bd942938, rb_left = 0x5627bd95b0e0}, list = {next = 0x5627bdafad60, prev = 0x5627bdafad60}, timeout = {tv_sec = 1603034867, tv_usec = 862976}, 
    active = 1, cb = 0x5627bb8bd0d0 <pdpctx_timer_cb>, data = 0x5627bdafacc0}, T = 3395, num_T_exp = 0, cdr_timer = {node = {rb_parent_color = 94728686758337, 
      rb_right = 0x0, rb_left = 0x0}, list = {next = 0x5627bdafadb8, prev = 0x5627bdafadb8}, timeout = {tv_sec = 1603035414, tv_usec = 613457}, active = 1, 
    cb = 0x5627bb8d00d0 <cdr_pdp_timeout>, data = 0x5627bdafacc0}, cdr_start = {tv_sec = 55141, tv_nsec = 130687750}, cdr_bytes_in = 12878, cdr_bytes_out = 17074, 
  cdr_charging_id = 99}

^^ ggsn = 0x0

Sorry, no leading log available right now.

History

#1 Updated by keith about 1 month ago

Huh?

#2  gsm0408_rcv_gsm (mmctx=mmctx@entry=0x5627bdb5bb30, msg=msg@entry=0x5627bda33550, [....]
int gsm0408_rcv_gsm(struct sgsn_mm_ctx *mmctx, struct msgb *msg,
               struct gprs_llc_llme *llme)
{
    struct gsm48_hdr *gh = (struct gsm48_hdr *) msgb_gmmh(msg);
    int rc;

    /* MMCTX can be NULL when called */

    if (!mmctx) {
        LOGGBIUP(llme, msg, LOGL_NOTICE, "Cannot handle SM for unknown MM CTX\n");
        /* 6.1.3.6 */
        if (gh->msg_type == GSM48_MT_GSM_STATUS)
            return 0;

        return gsm0408_gprs_force_reattach_oldmsg(msg, llme);
    }

    switch (gh->msg_type) {
    case GSM48_MT_GSM_ACT_PDP_REQ:
        rc = gsm48_rx_gsm_act_pdp_req(mmctx, msg);
        break;
    case GSM48_MT_GSM_DEACT_PDP_REQ:
        rc = gsm48_rx_gsm_deact_pdp_req(mmctx, msg);
#1  0x00005627bb8be8ff in gsm48_rx_gsm_deact_pdp_req (msg=0x0, mm=0x5627bdb5bb30) at gprs_sm.c:642

msg=0x0 ???

How the hell does that happen?

#2 Updated by laforge about 1 month ago

So if I'm following correctly, msg is still non-null ine stack frame 2, but then
is NULL in stack frame 1? If thre is no code path in the source setting it to NULL
meanwhile, it might be stack overflow? Can you reproduce with valgrind or --enable-sanitize?
enabled?

#3 Updated by keith about 1 month ago

I have recompiled with -O0 and left it running overnight - This morning, I have a new backtrace:

Program received signal SIGSEGV, Segmentation fault.
0x00005593d4c4a134 in sgsn_delete_pdp_ctx (pctx=0x5593d695e820) at sgsn_libgtp.c:321
321     sgsn_libgtp.c: No such file or directory.
(gdb) bt
#0  0x00005593d4c4a134 in sgsn_delete_pdp_ctx (pctx=0x5593d695e820) at sgsn_libgtp.c:321                                                                         
#1  0x00005593d4c3a6a7 in gsm48_rx_gsm_deact_pdp_req (mm=0x5593d69b9260, msg=0x5593d6930240) at gprs_sm.c:642                                                    
#2  0x00005593d4c3ac9e in gsm0408_rcv_gsm (mmctx=0x5593d69b9260, msg=0x5593d6930240, llme=0x5593d6c2d350) at gprs_sm.c:730                                       
#3  0x00005593d4c2c961 in gsm0408_gprs_rcvmsg_gb (msg=0x5593d6930240, llme=0x5593d6c2d350, drop_cipherable=false) at gprs_gb.c:69                                
#4  0x00005593d4c4df43 in gprs_llc_rcvmsg (msg=0x5593d6930240, tv=0x7ffc62e44780) at gprs_llc.c:1011                                                             
#5  0x00005593d4c44c24 in bssgp_prim_cb (oph=0x7ffc62e44740, ctx=0x0) at sgsn_main.c:128                                                                         
#6  0x00007f9b41dacc51 in bssgp_rcvmsg () from /lib/x86_64-linux-gnu/libosmogb.so.11                                                                             
#7  0x00005593d4c44b46 in sgsn_ns_cb (event=GPRS_NS_EVT_UNIT_DATA, nsvc=0x5593d677cb70, msg=0x5593d6930240, bvci=2) at sgsn_main.c:106                           
#8  0x00007f9b41da6744 in ?? () from /lib/x86_64-linux-gnu/libosmogb.so.11
#9  0x00007f9b41da836a in gprs_ns_rcvmsg () from /lib/x86_64-linux-gnu/libosmogb.so.11                                                                           
#10 0x00007f9b41da84d9 in ?? () from /lib/x86_64-linux-gnu/libosmogb.so.11
#11 0x00007f9b41d4e514 in ?? () from /lib/x86_64-linux-gnu/libosmocore.so.16
#12 0x00007f9b41d4ec26 in osmo_select_main () from /lib/x86_64-linux-gnu/libosmocore.so.16                                                                       
#13 0x00005593d4c457e9 in main (argc=3, argv=0x7ffc62e46c48) at sgsn_main.c:556
(gdb)
(gdb) frame
#1  0x00005593d4c3a6a7 in gsm48_rx_gsm_deact_pdp_req (mm=0x5593d69b9260, msg=0x5593d6930240) at gprs_sm.c:642
642     in gprs_sm.c
(gdb) info locals
gh = 0x5593d69302fa
transaction_id = 0 '\000'
pdp = 0x5593d695e820
(gdb) p *gh
$40 = {proto_discr = 10 '\n', msg_type = 70 'F', data = 0x5593d69302fc "$\221'\bI"} 
(gdb) p *pdp
$43 = {list = {next = 0x5593d69b93f8, prev = 0x5593d69b93f8}, g_list = {next = 0x5593d68aaf20, prev = 0x5593d6c7fe10}, mm = 0x5593d69b9260, destroy_ggsn = 0, 
  ggsn = 0x0, ggsn_list = {next = 0x100100, prev = 0x200200}, ctrg = 0x5593d6819e40, lib = 0x0, state = PDP_STATE_CR_REQ, type = PDP_TYPE_NONE, address = 0, 
  apn_subscribed = 0x0, nsapi = 5, sapi = 3, ti = 0 '\000', vplmn_allowed = 0, qos_profile_subscr = 0, radio_prio = 0 '\000', timer = {node = {
      rb_parent_color = 94093448641872, rb_right = 0x5593d6a08c70, rb_left = 0x5593d699e870}, list = {next = 0x5593d695e8c0, prev = 0x5593d695e8c0}, timeout = {
      tv_sec = 1603067602, tv_usec = 18742}, active = 1, cb = 0x5593d4c3a909 <pdpctx_timer_cb>, data = 0x5593d695e820}, T = 3395, num_T_exp = 0, cdr_timer = {
    node = {rb_parent_color = 94093449440992, rb_right = 0x0, rb_left = 0x0}, list = {next = 0x5593d695e918, prev = 0x5593d695e918}, timeout = {
      tv_sec = 1603067772, tv_usec = 134472}, active = 1, cb = 0x5593d4c53687 <cdr_pdp_timeout>, data = 0x5593d695e820}, cdr_start = {tv_sec = 86298, 
    tv_nsec = 651482255}, cdr_bytes_in = 216087, cdr_bytes_out = 194657, cdr_charging_id = 114}

#4 Updated by keith about 1 month ago

(gdb) frame
#2  0x00005593d4c3ac9e in gsm0408_rcv_gsm (mmctx=0x5593d69b9260, msg=0x5593d6930240, llme=0x5593d6c2d350) at gprs_sm.c:730
730     in gprs_sm.c
(gdb) p mmctx->imsi
$48 = "334020344006209" 

Possibly related log: (just before SIGSEGV)

20201018193313833 DGPRS INFO PDP(334020344006209/0) Delete PDP Context (sgsn_libgtp.c:318)                                                                        
20201018193314018 DLGTP ERROR Packet from 10.23.0.20:2152, length: 24 content: 32 1a 00 10 00 00 00 00 08 50 00 00 10 00 00 00 72 85 00 04 0a 17 00 14 : Received Error Indication (gtp.c:2802)                                                                                                                                     
20201018193314018 DGPRS INFO PDP(334020344006209:5): Context 0x7f9b40bd2a68 was deleted (sgsn_libgtp.c:588)                                                       
20201018193314018 DLGTP ERROR Packet from 10.23.0.20:2123, length: 14 content: 32 15 00 06 00 00 00 18 f9 88 00 00 01 80 : Confirmation packet not found in queue (gtp.c:688)
20201018193314019 DLGTP ERROR Packet from 10.23.0.20:2152, length: 24 content: 32 1a 00 10 00 00 00 00 08 51 00 00 10 00 00 00 72 85 00 04 0a 17 00 14 : Unknown PDP context (gtp.c:2793)
20201018193314022 DLGTP ERROR Packet from 10.23.0.20:2152, length: 24 content: 32 1a 00 10 00 00 00 00 08 52 00 00 10 00 00 00 72 85 00 04 0a 17 00 14 : Unknown PDP context (gtp.c:2793)
20201018193314022 DLGTP ERROR Packet from 10.23.0.20:2152, length: 24 content: 32 1a 00 10 00 00 00 00 08 54 00 00 10 00 00 00 72 85 00 04 0a 17 00 14 : Unknown PDP context (gtp.c:2793)
20201018193314023 DLGTP ERROR Packet from 10.23.0.20:2152, length: 24 content: 32 1a 00 10 00 00 00 00 08 55 00 00 10 00 00 00 72 85 00 04 0a 17 00 14 : Unknown PDP context (gtp.c:2793)

#5 Updated by pespin about 1 month ago

  • Status changed from New to Feedback
  • Assignee set to pespin
  • % Done changed from 0 to 90

Potential fix for the issue:
https://gerrit.osmocom.org/c/osmo-sgsn/+/20789 Log error if pdp ctx is freed while holding an active timer
https://gerrit.osmocom.org/c/osmo-sgsn/+/20790 Fix crash rx DeactPdpReq while waiting for DeactPdpAck after gtp side is freed

#6 Updated by keith about 1 month ago

Running now for 21 hours with no segfault.

#7 Updated by keith about 1 month ago

Waiting on data to confirm if further observed segfaults are related to this or not..

#8 Updated by keith about 1 month ago

A different code path this time, but the crash is in the same place:

static void process_ms_ctx_status(struct sgsn_mm_ctx *mmctx,
                  const uint8_t *pdp_status)
{
    struct sgsn_pdp_ctx *pdp, *pdp2;
    /* 24.008 4.7.5.1.3: If the PDP context status information element is
     * included in ROUTING AREA UPDATE REQUEST message, then the network
     * shall deactivate all those PDP contexts locally (without peer to
     * peer signalling between the MS and the network), which are not in SM
     * state PDP-INACTIVE on network side but are indicated by the MS as
     * being in state PDP-INACTIVE. */

    llist_for_each_entry_safe(pdp, pdp2, &mmctx->pdp_list, list) {
        if (pdp->nsapi < 8) {
            if (!(pdp_status[0] & (1 << pdp->nsapi))) {
                LOGMMCTXP(LOGL_NOTICE, mmctx, "Dropping PDP context for NSAPI=%u " 
                    "due to PDP CTX STATUS IE= 0x%02x%02x\n",
                    pdp->nsapi, pdp_status[1], pdp_status[0]);
                sgsn_delete_pdp_ctx(pdp); <------ 

// LINE 1532      ^^^^^^^^^^^^^^^^^^^
[ SNIP ]
}

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000055674b36b202 in sgsn_delete_pdp_ctx (pctx=0x55674c6c6230) at sgsn_libgtp.c:321
321    sgsn_libgtp.c: No such file or directory.
(gdb) bt
#0  0x000055674b36b202 in sgsn_delete_pdp_ctx (pctx=0x55674c6c6230) at sgsn_libgtp.c:321
#1  0x000055674b35367f in process_ms_ctx_status (mmctx=0x55674c9161a0, pdp_status=0x55674c772398 "") at gprs_gmm.c:1532
#2  0x000055674b3542a5 in gsm48_rx_gmm_ra_upd_req (mmctx=0x55674c9161a0, msg=0x55674c7722b0, llme=0x55674c6e63c0) at gprs_gmm.c:1752
#3  0x000055674b355135 in gsm0408_rcv_gmm (mmctx=0x55674c9161a0, msg=0x55674c7722b0, llme=0x55674c6e63c0, drop_cipherable=false) at gprs_gmm.c:2041
#4  0x000055674b34d942 in gsm0408_gprs_rcvmsg_gb (msg=0x55674c7722b0, llme=0x55674c6e63c0, drop_cipherable=false) at gprs_gb.c:66
#5  0x000055674b36f011 in gprs_llc_rcvmsg (msg=0x55674c7722b0, tv=0x7ffe3b0ad8a0) at gprs_llc.c:1011
#6  0x000055674b365cf2 in bssgp_prim_cb (oph=0x7ffe3b0ad860, ctx=0x0) at sgsn_main.c:128
#7  0x00007f03717dbc51 in bssgp_rcvmsg () from /lib/x86_64-linux-gnu/libosmogb.so.11
#8  0x000055674b365c14 in sgsn_ns_cb (event=GPRS_NS_EVT_UNIT_DATA, nsvc=0x55674c546910, msg=0x55674c7722b0, bvci=2) at sgsn_main.c:106
#9  0x00007f03717d5744 in ?? () from /lib/x86_64-linux-gnu/libosmogb.so.11
#10 0x00007f03717d736a in gprs_ns_rcvmsg () from /lib/x86_64-linux-gnu/libosmogb.so.11
#11 0x00007f03717d74d9 in ?? () from /lib/x86_64-linux-gnu/libosmogb.so.11
#12 0x00007f037177d514 in ?? () from /lib/x86_64-linux-gnu/libosmocore.so.16
#13 0x00007f037177dc26 in osmo_select_main () from /lib/x86_64-linux-gnu/libosmocore.so.16
#14 0x000055674b3668b7 in main (argc=3, argv=0x7ffe3b0afd68) at sgsn_main.c:556

(gdb) p *pctx
$1 = {list = {next = 0x55674c916338, prev = 0x55674c916338}, g_list = {next = 0x55674c946480, prev = 0x55674c696200}, mm = 0x55674c9161a0, destroy_ggsn = 0, ggsn = 0x0, ggsn_list = {
    next = 0x100100, prev = 0x200200}, ctrg = 0x55674c7f4670, lib = 0x0, state = PDP_STATE_CR_REQ, type = PDP_TYPE_NONE, address = 0, apn_subscribed = 0x0, nsapi = 5, sapi = 3, 
  ti = 0 '\000', vplmn_allowed = 0, qos_profile_subscr = 0, radio_prio = 0 '\000', timer = {node = {rb_parent_color = 93902154786041, rb_right = 0x0, rb_left = 0x0}, list = {
      next = 0x55674c6c62d0, prev = 0x55674c6c62d0}, timeout = {tv_sec = 1603387907, tv_usec = 785285}, active = 1, cb = 0x55674b35b9d7 <pdpctx_timer_cb>, data = 0x55674c6c6230}, 
  T = 3395, num_T_exp = 4, cdr_timer = {node = {rb_parent_color = 0, rb_right = 0x0, rb_left = 0x0}, list = {next = 0x0, prev = 0x0}, timeout = {tv_sec = 0, tv_usec = 0}, active = 0, 
    cb = 0x0, data = 0x0}, cdr_start = {tv_sec = 0, tv_nsec = 0}, cdr_bytes_in = 184, cdr_bytes_out = 0, cdr_charging_id = 0}

#9 Updated by keith about 1 month ago

  • Status changed from Feedback to In Progress
  • % Done changed from 90 to 70

:)

#10 Updated by pespin about 1 month ago

  • Status changed from In Progress to Feedback
  • Assignee changed from pespin to keith

I submitted these bunch of patches, please give it a try:
https://gerrit.osmocom.org/c/osmo-sgsn/+/20842 sgsn_delete_pdp_ctx: Add documentation and assert assumptions [NEW]
https://gerrit.osmocom.org/c/osmo-sgsn/+/20843 process_ms_ctx_status: refactor to avoid code duplication [NEW]
https://gerrit.osmocom.org/c/osmo-sgsn/+/20844 process_ms_ctx_status: Fix crash deleting PDP Ctx if GTP side was already ... [NEW]

#11 Updated by keith about 1 month ago

I have not deployed this yet everywhere, but on the sites that use nightly sgsn packages, I have not see a crash since this was merged.

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)