Bug #2741
closedosmo-ggsn segfaults after receiving SIGINT
100%
Description
while running osmo-ggsn in osmo-gsm-tester, at the time osmo-gsm-tester wants to finish the test it sends a SIGINT to stop osmo-ggsn.
After the SIGINT is sent, a segfautl can be sen in dmesg:
[1887221.478570] osmo-ggsn[20927]: segfault at b0 ip 00007f7a1a7d29a3 sp 00007ffe630d91f0 error 6 in libgtp.so.2.0.0[7f7a1a7ca000+11000]
The attached core file shows the following bt
Program terminated with signal SIGSEGV, Segmentation fault. (gdb) bt #0 gtp_delete_context_req (gsn=0x0, pdp=0x7f7a1a9dd5c0 <pdpa>, cbp=0x4, teardown=1) at gtp.c:2309 #1 0x000000000040453f in pool_close_all_pdp (pool=0x7ffe630d9200) at ggsn.c:109 #2 0x0000000000405b66 in apn_stop (apn=0x14b2b50, force=false, force@entry=true) at ggsn.c:117 #3 0x0000000000406635 in ggsn_stop (ggsn=ggsn@entry=0x14b2690) at ggsn.c:868 #4 0x00000000004028d2 in ggsn_stop (ggsn=0x14b2690) at ggsn.c:863 #5 main (argc=3, argv=<optimized out>) at ggsn.c:1002
It seems the pointer passed as parameter to pool_close_all_pdp is corrupted somehow, because it's not the one in the structure:
(gdb) frame 2 #2 0x0000000000405b66 in apn_stop (apn=0x14b2b50, force=false, force@entry=true) at ggsn.c:117 (gdb) print apn $3 = (struct apn_ctx *) 0x14b2b50 (gdb) print *apn $4 = {list = {next = 0x14b2fa0, prev = 0x14b26a0}, ggsn = 0x14b2690, started = true, cfg = { name = 0x14b2410 "internet", description = 0x0, name_list = {next = 0x14b2b80, prev = 0x14b2b80}, apn_type_mask = 1, gtpu_mode = APN_GTPU_MODE_TUN, shutdown = false, tx_gpdu_seq = true}, tun = {cfg = { dev_name = 0x14b2980 "tun4", ipup_script = 0x0, ipdown_script = 0x0}, tun = 0x146d6e0, fd = {list = { next = 0x14b3010, prev = 0x7f7a19f3e950 <osmo_fds>}, fd = 5, when = 1, cb = 0x404cd0 <ggsn_tun_fd_cb>, data = 0x14b2b50, priv_nr = 0}}, v6_lladdr = {__in6_u = {__u6_addr8 = '\000' <repeats 15 times>, __u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}, v4 = {cfg = {ifconfig_prefix = { addr = {len = 4 '\004', {v4 = {s_addr = 31330480}, v6 = {__in6_u = { __u6_addr8 = "\260\020\336\001", '\000' <repeats 11 times>, __u6_addr16 = {4272, 478, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {31330480, 0, 0, 0}}}}}, prefixlen = 24 '\030'}, static_prefix = {addr = { len = 0 '\000', {v4 = {s_addr = 0}, v6 = {__in6_u = {__u6_addr8 = '\000' <repeats 15 times>, __u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}}}, prefixlen = 0 '\000'}, dynamic_prefix = {addr = {len = 4 '\004', {v4 = {s_addr = 14553264}, v6 = {__in6_u = { __u6_addr8 = "\260\020\336", '\000' <repeats 12 times>, __u6_addr16 = {4272, 222, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {14553264, 0, 0, 0}}}}}, prefixlen = 24 '\030'}, dns = {{len = 4 '\004', { v4 = {s_addr = 134744072}, v6 = {__in6_u = {__u6_addr8 = "\b\b\b\b", '\000' <repeats 11 times>, __u6_addr16 = {2056, 2056, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {134744072, 0, 0, 0}}}}}, { len = 4 '\004', {v4 = {s_addr = 67635208}, v6 = {__in6_u = { __u6_addr8 = "\b\b\b\004", '\000' <repeats 11 times>, __u6_addr16 = {2056, 1032, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {67635208, 0, 0, 0}}}}}}}, pool = 0x14b2d80}, v6 = {cfg = {ifconfig_prefix = { addr = {len = 0 '\000', {v4 = {s_addr = 0}, v6 = {__in6_u = {__u6_addr8 = '\000' <repeats 15 times>, __u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}}}, prefixlen = 0 '\000'}, static_prefix = {addr = {len = 0 '\000', {v4 = {s_addr = 0}, v6 = {__in6_u = { __u6_addr8 = '\000' <repeats 15 times>, __u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}}}, prefixlen = 0 '\000'}, dynamic_prefix = {addr = {len = 0 '\000', {v4 = { s_addr = 0}, v6 = {__in6_u = {__u6_addr8 = '\000' <repeats 15 times>, __u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}}}, prefixlen = 0 '\000'}, dns = {{len = 0 '\000', {v4 = { s_addr = 0}, v6 = {__in6_u = {__u6_addr8 = '\000' <repeats 15 times>, __u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}}}, {len = 0 '\000', {v4 = {s_addr = 0}, v6 = {__in6_u = { __u6_addr8 = '\000' <repeats 15 times>, __u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}}}}}, pool = 0x0}} (gdb) print apn->v4 $5 = {cfg = {ifconfig_prefix = {addr = {len = 4 '\004', {v4 = {s_addr = 31330480}, v6 = {__in6_u = { __u6_addr8 = "\260\020\336\001", '\000' <repeats 11 times>, __u6_addr16 = {4272, 478, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {31330480, 0, 0, 0}}}}}, prefixlen = 24 '\030'}, static_prefix = {addr = { len = 0 '\000', {v4 = {s_addr = 0}, v6 = {__in6_u = {__u6_addr8 = '\000' <repeats 15 times>, __u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}}}, prefixlen = 0 '\000'}, dynamic_prefix = {addr = {len = 4 '\004', {v4 = {s_addr = 14553264}, v6 = {__in6_u = { __u6_addr8 = "\260\020\336", '\000' <repeats 12 times>, __u6_addr16 = {4272, 222, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {14553264, 0, 0, 0}}}}}, prefixlen = 24 '\030'}, dns = {{len = 4 '\004', {v4 = { s_addr = 134744072}, v6 = {__in6_u = {__u6_addr8 = "\b\b\b\b", '\000' <repeats 11 times>, __u6_addr16 = {2056, 2056, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {134744072, 0, 0, 0}}}}}, { len = 4 '\004', {v4 = {s_addr = 67635208}, v6 = {__in6_u = { __u6_addr8 = "\b\b\b\004", '\000' <repeats 11 times>, __u6_addr16 = {2056, 1032, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {67635208, 0, 0, 0}}}}}}}, pool = 0x14b2d80} (gdb) print apn->v4.pool $7 = (struct ippool_t *) 0x14b2d80
So, we have apn->v4.pool=0x14b2d80 and in the code we have the following call: pool_close_all_pdp(apn->v4.pool);
However, according to gdb:
(gdb) frame 1 #1 0x000000000040453f in pool_close_all_pdp (pool=0x7ffe630d9200) at ggsn.c:109
So I'm not sure how could that happen. Stack corruption?
Also, last lines of osmo-ggsn log before crashing:
20171212153454402 DGGSN <0002> ggsn.c:525 PDP(901700000009031:5): Processing create PDP context request for APN 'ims' 20171212153454402 DGGSN <0002> ggsn.c:635 PDP(901700000009031:5): APN doesn't support requested EUA / AF type 20171212153454403 DLGTP <000d> pdp.c:255 Begin pdp_tiddel tid = 5130900000007109 20171212153454403 DLGTP <000d> pdp.c:262 End pdp_tiddel: PDP found 20171212153454484 DGGSN <0002> ggsn.c:702 PDP(901700000009031:6): Packet received on APN(inet46): forwarding to tun tun46 20171212153459202 DGGSN <0002> ggsn.c:342 PDP(901700000009031:6): Deleting PDP context 20171212153459202 DLGTP <000d> pdp.c:255 Begin pdp_tiddel tid = 6130900000007109 20171212153459202 DLGTP <000d> pdp.c:262 End pdp_tiddel: PDP found 20171212153506145 DLGTP <000d> gtp.c:1522 gtp_create_pdp_ind: Before pdp_tidget 20171212153506145 DLGTP <000d> pdp.c:275 Begin pdp_tidget tid = 5130900000007109 20171212153506145 DLGTP <000d> pdp.c:283 Begin pdp_tidget. Not found 20171212153506145 DLGTP <000d> pdp.c:237 Begin pdp_tidset tid = 5130900000007109 20171212153506145 DLGTP <000d> pdp.c:246 End pdp_tidset 20171212153506145 DGGSN <0002> ggsn.c:525 PDP(901700000009031:5): Processing create PDP context request for APN 'inet46' 20171212153506146 DGGSN <0002> ggsn.c:625 PDP(901700000009031:5): Successful PDP Context Creation: APN=inet46(inet46), TEIC=1, IP=fde4:8dba:82e1:2002:: 20171212153507420 DGGSN <0002> ggsn.c:702 PDP(901700000009031:5): Packet received on APN(inet46): forwarding to tun tun46 20171212153507566 DTUN <0001> ggsn.c:673 Received packet for APN(inet46) from tun tun46 with no PDP contex!! 20171212153511662 DTUN <0001> ggsn.c:673 Received packet for APN(inet6) from tun tun6 with no PDP contex!! 20171212153513120 DGGSN <0002> ggsn.c:342 PDP(901700000009031:5): Deleting PDP context 20171212153513121 DLGTP <000d> pdp.c:255 Begin pdp_tiddel tid = 5130900000007109 20171212153513121 DLGTP <000d> pdp.c:262 End pdp_tiddel: PDP found 20171212153515861 DGGSN <0002> ggsn.c:782 signal 2 received 20171212153515861 DGGSN <0002> ggsn.c:786 SIGINT received, shutting down 20171212153515861 DGGSN <0002> ggsn.c:115 APN(internet): FORCED Stopping 20171212153515861 DGGSN <0002> ggsn.c:108 PDP(0000000000000000:0): Sending DELETE PDP CTX due to shutdown
Files
Updated by pespin over 5 years ago
Reminder: gtp_delete_context_req is involved in the backtrace. Since a few weeks ago gtp_delete_context_req is deprecated and we have gtp_delete_context_req2, we should probably move to that (it's used already in osmo-sgsn, see git log for more information). The crash could be related to same fact for which gtp_delete_context_req2 was required in osmo-sgsn.
Updated by osmith almost 5 years ago
Pau just wrote:
I cannot find any proof that it's still crashing
Still it probably makes sense to get rid of gtp_delete_context_req() use in osmo-ggsn and use gtp_delete_context_req2
So I'll work on that.
Updated by pespin almost 5 years ago
I cannot find osmo-gsm-tester generating related core files anymore, and no segfault indications in dmesg, so it seems the issue is gone (during they last year).
Still it makes sense to get rid of gtp_delete_context_req() and use new gtp_delete_context_req2() instead.
Related:
osmo-sgsn 03dc773e081c608524a0907cb1d867f792156157
osmo-ggsn 8e8c7ef3c7c2822737e7b74d299d520f0f4d94e0
Updated by osmith almost 5 years ago
The submitted patch modified osmo-ggsn to use gtp_delete_context_req2() instead of gtp_delete_context_req(), and should free up pdp in the cb_conf() callback.
However, it seems that cb_conf() never gets executed.
First I have tried to connect a smartphone to the sgsn, then close the data connection on the phone. But this will only send a Delete Context Request from SGSN to GGSN, not the other way around. It does not run the code that has been patched that way.
SGSN -> GGSN Delte PDP context req SGSN <- GGSN Delte PDP context res: Cause: Request accepted (128) SGSN: runs cb_conf(), frees up pdp context
After looking through the code, I found that it is possible to shutdown active connections from the GGSN from the vty, by running these commands after connecting a MS and having an open data connection:
enable configure terminal ggsn ggsn0 shutdown ggsn
(bring it up again with "no shutdown ggsn".)
It looks like that:
GGSN -> SGSN Delete PDP context req GGSN <- SGSN Delete PDP context res: Cause: request accepted (128) GGSN: does NOT run cb_conf()! :\
So we can't merge the patch as is, because it would never free up the pdp context.
pespin, do you have an idea? (Otherwise I'll work on something else now, not using that deprecated function seems not to be very critical, and I've put quite a lot of time into this already.)
Updated by pespin almost 5 years ago
My guess is that due to shutting down the gsn, probably some structures are dropped and received messages don't follow expected paths in this case. But without logs or pcap files it's difficult to know.
Please mark the gerrit patch as WIP and either keep it for you and have a look at it later or assign it to me so I can look at it at some point.
Updated by osmith almost 5 years ago
The gerrit patch is already marked as WIP, I'll assign it to you. Here is a pcap dump of running "shutdown ggsn" on the tty.
For recording that pcap file, I had modified the callback to log messages, when it would reach the pdp_freepdp() code path for debugging purpose (see below), but both log messages do not show up in the gsmtap log. (Tried the same thing with a gdb breakpoint just to make sure, it never hits the breakpoint.)
/* libgtp callback for confirmations */ static int cb_conf(int type, int cause, struct pdp_t *pdp, void *cbp) { DEBUGP(DGGSN, "libgtp cb_conf(type=%d, cause=%d, pdp=%p, cbp=%p)\n", type, cause, pdp, cbp); switch (type) { case GTP_DELETE_PDP_REQ: DEBUGP(DGGSN, "*** about to run pdp_freepdp! ***\n"); return pdp_freepdp(pdp); } return 0; }
Updated by pespin almost 5 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Submitted bunch of patches documenting lifecycle, code logic and making it more robust:
remote: https://gerrit.osmocom.org/#/c/osmo-ggsn/+/14282 gtp: Document spec reasoning drop of Rx DeleteCtxReq
remote: https://gerrit.osmocom.org/#/c/osmo-ggsn/+/14283 pdp: Introduce new API pdp_count_secondary
remote: https://gerrit.osmocom.org/#/c/osmo-ggsn/+/14284 gtp_create_pdp_ind: simplify code by reordering and compacting parsing
remote: https://gerrit.osmocom.org/#/c/osmo-ggsn/+/14285 gtp: Refactor code to use gtp_freepdp(_teardown) APIs
remote: https://gerrit.osmocom.org/#/c/osmo-ggsn/+/14286 gtp: Document free pdp ctx in non-teardown scenario
remote: https://gerrit.osmocom.org/#/c/osmo-ggsn/+/14011 ggsn: Use gtp_delete_context_req2() everywhere
They were run fine against TTCN ggsn and sgsn tests, and tested locally with 2 phones. They are merged since a few days ago, closing the ticket.