Project

General

Profile

Bug #2741

osmo-ggsn segfaults after receiving SIGINT

Added by pespin over 1 year ago. Updated 9 days ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
12/12/2017
Due date:
% Done:

0%

Spec Reference:

Description

while running osmo-ggsn in osmo-gsm-tester, at the time osmo-gsm-tester wants to finish the test it sends a SIGINT to stop osmo-ggsn.

After the SIGINT is sent, a segfautl can be sen in dmesg:

[1887221.478570] osmo-ggsn[20927]: segfault at b0 ip 00007f7a1a7d29a3 sp 00007ffe630d91f0 error 6 in libgtp.so.2.0.0[7f7a1a7ca000+11000]

The attached core file shows the following bt

Program terminated with signal SIGSEGV, Segmentation fault.
(gdb) bt
#0  gtp_delete_context_req (gsn=0x0, pdp=0x7f7a1a9dd5c0 <pdpa>, cbp=0x4, teardown=1) at gtp.c:2309
#1  0x000000000040453f in pool_close_all_pdp (pool=0x7ffe630d9200) at ggsn.c:109
#2  0x0000000000405b66 in apn_stop (apn=0x14b2b50, force=false, force@entry=true) at ggsn.c:117
#3  0x0000000000406635 in ggsn_stop (ggsn=ggsn@entry=0x14b2690) at ggsn.c:868
#4  0x00000000004028d2 in ggsn_stop (ggsn=0x14b2690) at ggsn.c:863
#5  main (argc=3, argv=<optimized out>) at ggsn.c:1002

It seems the pointer passed as parameter to pool_close_all_pdp is corrupted somehow, because it's not the one in the structure:

(gdb) frame 2
#2  0x0000000000405b66 in apn_stop (apn=0x14b2b50, force=false, force@entry=true) at ggsn.c:117
(gdb) print apn
$3 = (struct apn_ctx *) 0x14b2b50
(gdb) print *apn
$4 = {list = {next = 0x14b2fa0, prev = 0x14b26a0}, ggsn = 0x14b2690, started = true, cfg = {
    name = 0x14b2410 "internet", description = 0x0, name_list = {next = 0x14b2b80, prev = 0x14b2b80},
    apn_type_mask = 1, gtpu_mode = APN_GTPU_MODE_TUN, shutdown = false, tx_gpdu_seq = true}, tun = {cfg = {
      dev_name = 0x14b2980 "tun4", ipup_script = 0x0, ipdown_script = 0x0}, tun = 0x146d6e0, fd = {list = {
        next = 0x14b3010, prev = 0x7f7a19f3e950 <osmo_fds>}, fd = 5, when = 1, cb = 0x404cd0 <ggsn_tun_fd_cb>,
      data = 0x14b2b50, priv_nr = 0}}, v6_lladdr = {__in6_u = {__u6_addr8 = '\000' <repeats 15 times>,
      __u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}, v4 = {cfg = {ifconfig_prefix = {
        addr = {len = 4 '\004', {v4 = {s_addr = 31330480}, v6 = {__in6_u = {
                __u6_addr8 = "\260\020\336\001", '\000' <repeats 11 times>, __u6_addr16 = {4272, 478, 0, 0, 0, 0,
                  0, 0}, __u6_addr32 = {31330480, 0, 0, 0}}}}}, prefixlen = 24 '\030'}, static_prefix = {addr = {
          len = 0 '\000', {v4 = {s_addr = 0}, v6 = {__in6_u = {__u6_addr8 = '\000' <repeats 15 times>,
                __u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}}}, prefixlen = 0 '\000'},
      dynamic_prefix = {addr = {len = 4 '\004', {v4 = {s_addr = 14553264}, v6 = {__in6_u = {
                __u6_addr8 = "\260\020\336", '\000' <repeats 12 times>, __u6_addr16 = {4272, 222, 0, 0, 0, 0, 0,
                  0}, __u6_addr32 = {14553264, 0, 0, 0}}}}}, prefixlen = 24 '\030'}, dns = {{len = 4 '\004', {
            v4 = {s_addr = 134744072}, v6 = {__in6_u = {__u6_addr8 = "\b\b\b\b", '\000' <repeats 11 times>,
                __u6_addr16 = {2056, 2056, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {134744072, 0, 0, 0}}}}}, {
          len = 4 '\004', {v4 = {s_addr = 67635208}, v6 = {__in6_u = {
                __u6_addr8 = "\b\b\b\004", '\000' <repeats 11 times>, __u6_addr16 = {2056, 1032, 0, 0, 0, 0, 0,
                  0}, __u6_addr32 = {67635208, 0, 0, 0}}}}}}}, pool = 0x14b2d80}, v6 = {cfg = {ifconfig_prefix = {
        addr = {len = 0 '\000', {v4 = {s_addr = 0}, v6 = {__in6_u = {__u6_addr8 = '\000' <repeats 15 times>,
                __u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}}}, prefixlen = 0 '\000'},
      static_prefix = {addr = {len = 0 '\000', {v4 = {s_addr = 0}, v6 = {__in6_u = {
                __u6_addr8 = '\000' <repeats 15 times>, __u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0,
                  0, 0, 0}}}}}, prefixlen = 0 '\000'}, dynamic_prefix = {addr = {len = 0 '\000', {v4 = {
              s_addr = 0}, v6 = {__in6_u = {__u6_addr8 = '\000' <repeats 15 times>, __u6_addr16 = {0, 0, 0, 0, 0,
                  0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}}}, prefixlen = 0 '\000'}, dns = {{len = 0 '\000', {v4 = {
              s_addr = 0}, v6 = {__in6_u = {__u6_addr8 = '\000' <repeats 15 times>, __u6_addr16 = {0, 0, 0, 0, 0,
                  0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}}}, {len = 0 '\000', {v4 = {s_addr = 0}, v6 = {__in6_u = {
                __u6_addr8 = '\000' <repeats 15 times>, __u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0,
                  0, 0, 0}}}}}}}, pool = 0x0}}
(gdb) print apn->v4
$5 = {cfg = {ifconfig_prefix = {addr = {len = 4 '\004', {v4 = {s_addr = 31330480}, v6 = {__in6_u = {
              __u6_addr8 = "\260\020\336\001", '\000' <repeats 11 times>, __u6_addr16 = {4272, 478, 0, 0, 0, 0,
                0, 0}, __u6_addr32 = {31330480, 0, 0, 0}}}}}, prefixlen = 24 '\030'}, static_prefix = {addr = {
        len = 0 '\000', {v4 = {s_addr = 0}, v6 = {__in6_u = {__u6_addr8 = '\000' <repeats 15 times>,
              __u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}}}, prefixlen = 0 '\000'},
    dynamic_prefix = {addr = {len = 4 '\004', {v4 = {s_addr = 14553264}, v6 = {__in6_u = {
              __u6_addr8 = "\260\020\336", '\000' <repeats 12 times>, __u6_addr16 = {4272, 222, 0, 0, 0, 0, 0,
                0}, __u6_addr32 = {14553264, 0, 0, 0}}}}}, prefixlen = 24 '\030'}, dns = {{len = 4 '\004', {v4 = {
            s_addr = 134744072}, v6 = {__in6_u = {__u6_addr8 = "\b\b\b\b", '\000' <repeats 11 times>,
              __u6_addr16 = {2056, 2056, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {134744072, 0, 0, 0}}}}}, {
        len = 4 '\004', {v4 = {s_addr = 67635208}, v6 = {__in6_u = {
              __u6_addr8 = "\b\b\b\004", '\000' <repeats 11 times>, __u6_addr16 = {2056, 1032, 0, 0, 0, 0, 0, 0},
              __u6_addr32 = {67635208, 0, 0, 0}}}}}}}, pool = 0x14b2d80}
(gdb) print apn->v4.pool
$7 = (struct ippool_t *) 0x14b2d80

So, we have apn->v4.pool=0x14b2d80 and in the code we have the following call: pool_close_all_pdp(apn->v4.pool);
However, according to gdb:

(gdb) frame 1
#1  0x000000000040453f in pool_close_all_pdp (pool=0x7ffe630d9200) at ggsn.c:109

So I'm not sure how could that happen. Stack corruption?

Also, last lines of osmo-ggsn log before crashing:

20171212153454402 DGGSN <0002> ggsn.c:525 PDP(901700000009031:5): Processing create PDP context request for APN 'ims'
20171212153454402 DGGSN <0002> ggsn.c:635 PDP(901700000009031:5): APN doesn't support requested EUA / AF type
20171212153454403 DLGTP <000d> pdp.c:255 Begin pdp_tiddel tid = 5130900000007109
20171212153454403 DLGTP <000d> pdp.c:262 End pdp_tiddel: PDP found
20171212153454484 DGGSN <0002> ggsn.c:702 PDP(901700000009031:6): Packet received on APN(inet46): forwarding to tun tun46
20171212153459202 DGGSN <0002> ggsn.c:342 PDP(901700000009031:6): Deleting PDP context
20171212153459202 DLGTP <000d> pdp.c:255 Begin pdp_tiddel tid = 6130900000007109
20171212153459202 DLGTP <000d> pdp.c:262 End pdp_tiddel: PDP found
20171212153506145 DLGTP <000d> gtp.c:1522 gtp_create_pdp_ind: Before pdp_tidget
20171212153506145 DLGTP <000d> pdp.c:275 Begin pdp_tidget tid = 5130900000007109
20171212153506145 DLGTP <000d> pdp.c:283 Begin pdp_tidget. Not found
20171212153506145 DLGTP <000d> pdp.c:237 Begin pdp_tidset tid = 5130900000007109
20171212153506145 DLGTP <000d> pdp.c:246 End pdp_tidset
20171212153506145 DGGSN <0002> ggsn.c:525 PDP(901700000009031:5): Processing create PDP context request for APN 'inet46'
20171212153506146 DGGSN <0002> ggsn.c:625 PDP(901700000009031:5): Successful PDP Context Creation: APN=inet46(inet46), TEIC=1, IP=fde4:8dba:82e1:2002::
20171212153507420 DGGSN <0002> ggsn.c:702 PDP(901700000009031:5): Packet received on APN(inet46): forwarding to tun tun46
20171212153507566 DTUN <0001> ggsn.c:673 Received packet for APN(inet46) from tun tun46 with no PDP contex!!
20171212153511662 DTUN <0001> ggsn.c:673 Received packet for APN(inet6) from tun tun6 with no PDP contex!!
20171212153513120 DGGSN <0002> ggsn.c:342 PDP(901700000009031:5): Deleting PDP context
20171212153513121 DLGTP <000d> pdp.c:255 Begin pdp_tiddel tid = 5130900000007109
20171212153513121 DLGTP <000d> pdp.c:262 End pdp_tiddel: PDP found
20171212153515861 DGGSN <0002> ggsn.c:782 signal 2 received
20171212153515861 DGGSN <0002> ggsn.c:786 SIGINT received, shutting down
20171212153515861 DGGSN <0002> ggsn.c:115 APN(internet): FORCED Stopping
20171212153515861 DGGSN <0002> ggsn.c:108 PDP(0000000000000000:0): Sending DELETE PDP CTX due to shutdown

last_run.tgz last_run.tgz 485 KB pespin, 12/12/2017 03:18 PM
core.osmo-ggsn.20927.tar.bzip core.osmo-ggsn.20927.tar.bzip 169 KB pespin, 12/12/2017 03:19 PM
02-shutdown-from-ggsn-via-vty-export.pcapng 02-shutdown-from-ggsn-via-vty-export.pcapng 11.8 KB osmith, 05/16/2019 08:22 AM

History

#1 Updated by laforge over 1 year ago

  • Assignee set to pespin

#2 Updated by laforge 6 months ago

  • Assignee changed from pespin to osmith

#3 Updated by pespin 6 months ago

Reminder: gtp_delete_context_req is involved in the backtrace. Since a few weeks ago gtp_delete_context_req is deprecated and we have gtp_delete_context_req2, we should probably move to that (it's used already in osmo-sgsn, see git log for more information). The crash could be related to same fact for which gtp_delete_context_req2 was required in osmo-sgsn.

#4 Updated by osmith 15 days ago

  • Status changed from New to In Progress

#5 Updated by osmith 12 days ago

Pau just wrote:

I cannot find any proof that it's still crashing
Still it probably makes sense to get rid of gtp_delete_context_req() use in osmo-ggsn and use gtp_delete_context_req2

So I'll work on that.

#6 Updated by pespin 12 days ago

I cannot find osmo-gsm-tester generating related core files anymore, and no segfault indications in dmesg, so it seems the issue is gone (during they last year).

Still it makes sense to get rid of gtp_delete_context_req() and use new gtp_delete_context_req2() instead.

Related:
osmo-sgsn 03dc773e081c608524a0907cb1d867f792156157
osmo-ggsn 8e8c7ef3c7c2822737e7b74d299d520f0f4d94e0

#7 Updated by osmith 12 days ago

Thanks for the pointers, patch submitted:

https://gerrit.osmocom.org/#/c/osmo-ggsn/+/14011/

#8 Updated by osmith 10 days ago

The submitted patch modified osmo-ggsn to use gtp_delete_context_req2() instead of gtp_delete_context_req(), and should free up pdp in the cb_conf() callback.

However, it seems that cb_conf() never gets executed.

First I have tried to connect a smartphone to the sgsn, then close the data connection on the phone. But this will only send a Delete Context Request from SGSN to GGSN, not the other way around. It does not run the code that has been patched that way.

SGSN -> GGSN Delte PDP context req
SGSN <- GGSN Delte PDP context res: Cause: Request accepted (128)
SGSN: runs cb_conf(), frees up pdp context

After looking through the code, I found that it is possible to shutdown active connections from the GGSN from the vty, by running these commands after connecting a MS and having an open data connection:

enable
configure terminal
ggsn ggsn0
shutdown ggsn

(bring it up again with "no shutdown ggsn".)

It looks like that:

GGSN -> SGSN Delete PDP context req
GGSN <- SGSN Delete PDP context res: Cause: request accepted (128)
GGSN: does NOT run cb_conf()! :\

So we can't merge the patch as is, because it would never free up the pdp context.

pespin, do you have an idea? (Otherwise I'll work on something else now, not using that deprecated function seems not to be very critical, and I've put quite a lot of time into this already.)

#9 Updated by pespin 10 days ago

My guess is that due to shutting down the gsn, probably some structures are dropped and received messages don't follow expected paths in this case. But without logs or pcap files it's difficult to know.
Please mark the gerrit patch as WIP and either keep it for you and have a look at it later or assign it to me so I can look at it at some point.

#10 Updated by osmith 9 days ago

The gerrit patch is already marked as WIP, I'll assign it to you. Here is a pcap dump of running "shutdown ggsn" on the tty.

For recording that pcap file, I had modified the callback to log messages, when it would reach the pdp_freepdp() code path for debugging purpose (see below), but both log messages do not show up in the gsmtap log. (Tried the same thing with a gdb breakpoint just to make sure, it never hits the breakpoint.)

/* libgtp callback for confirmations */
static int cb_conf(int type, int cause, struct pdp_t *pdp, void *cbp)
{
    DEBUGP(DGGSN, "libgtp cb_conf(type=%d, cause=%d, pdp=%p, cbp=%p)\n",
        type, cause, pdp, cbp);    

    switch (type) {
    case GTP_DELETE_PDP_REQ:
        DEBUGP(DGGSN, "*** about to run pdp_freepdp! ***\n");
        return pdp_freepdp(pdp);
    }
    return 0;
}

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)