Bug #5523
closedsigsegv race condition destroying everything due to "reset_all_state"
100%
Description
There seems to be some use-after-free when a tunnel is freed. This seems to be a race conditions between thread "tun_device_thread" and the main thread.
It can be triggered by running PGW_Tests.TC_createSession_ping4_256. It doesn't always show a backtrace though, probably depending on the memory chunk state after being freed. In general osmo-uecups seems to be end up in some weird hanging stuff, which can be seen by running extra tests afterwards. It actually never answers the "reset_all_state" with "reset_all_state_res".
PGW_Tests.TC_createSession_ping4
20220411142037407 DTUN tun_device.c:391 ping251: Destroying 20220411142037407 DGT gtp_tunnel.c:136 ping252-Rd39cdfc4-T000003f2: Destroying 20220411142037407 DEP gtp_endpoint.c:229 172.18.18.20:2152: Release; new use_count=3 20220411142037408 DTUN tun_device.c:391 ping252: Destroying 20220411142037408 DGT gtp_tunnel.c:136 ping253-Rf59cc758-T000003f6: Destroying 20220411142037409 DEP gtp_endpoint.c:229 172.18.18.20:2152: Release; new use_count=2 20220411142037409 DTUN tun_device.c:391 ping253: Destroying 20220411142037409 DGT gtp_tunnel.c:136 ping254-R104d44b1-T000003fa: Destroying 20220411142037410 DEP gtp_endpoint.c:229 172.18.18.20:2152: Release; new use_count=1 [Thread 0x7f2e5c510700 (LWP 778) exited] 20220411142037467 DTUN tun_device.c:391 ping254: Destroying 20220411142037467 DGT gtp_tunnel.c:136 ping255-R24acc174-T000003fe: Destroying 20220411142037467 DEP gtp_endpoint.c:183 172.18.18.20:2152: Destroying 20220411142037467 DTUN tun_device.c:391 ping255: Destroying [Thread 0x7f2e7b54e700 (LWP 592) exited] [Thread 0x7f2e5bd0f700 (LWP 779) exited] Thread 257 "osmo-uecups-dae" received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7f2e5b50e700 (LWP 784)] 0x0000564ea59b2e19 in _gtp_tunnel_find_eua (tun=0x564ea5f670e0, sa=0x7f2e5b4fac20, proto=58 ':') at gtp_tunnel.c:125 125 llist_for_each_entry(t, &d->gtp_tunnels, list) { #0 0x0000564ea59b2e19 in _gtp_tunnel_find_eua (tun=0x564ea5f670e0, sa=0x7f2e5b4fac20, proto=58 ':') at gtp_tunnel.c:125 #1 0x0000564ea59afe0e in tun_device_thread (arg=0x564ea5f670e0) at tun_device.c:166 #2 0x00007f2edb8cdea7 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #3 0x00007f2edb692def in clone () from /lib/x86_64-linux-gnu/libc.so.6 (gdb) quit A debugging session is active.
MAIN THREAD: cups_client_handle_json cups_client_handle_reset_all_state pthread_rwlock_wrlock foreach_tun { _gtp_tunnel_destroy } _gtp_endpoint_release _gtp_endpoint_destroy (ep->use_count is 0) pthread_cancel(ep->thread); _tun_device_release _tun_device_destroy (tun->use_count is 0) pthread_cancel(tun->thread); close(tun->fd); pthread_rwlock_unlock Thread 257: tun_device_thread(data = tun) read(tun->fd) pthread_rwlock_rdlock _gtp_tunnel_find_eua(tun) <-- accessing tun crashes pthread_rwlock_unlock
The problem here seems to be that tun pointer being passed to the thread is not fully protected through the rwlock, only partially. Which means read() can happen properly, then main thread freed the tun object, then tun pointer is accessed in the worker thread.
This happens probably because pthread_cancel will stop the thread only when reaching syscall points.
Updated by pespin almost 2 years ago
man pthread_cancel:
A thread's cancellation type, determined by pthread_setcanceltype(3), may be either asynchronous or deferred (the default for new threads). Asynchronous cancelability means that the thread can be canceled at any time (usually immediately, but the system does not guarantee this). Deferred cancelability means that cancellation will be de‐ layed until the thread next calls a function that is a cancellation point. A list of functions that are or may be cancellation points is provided in pthreads(7).
Updated by pespin almost 2 years ago
Probably calling pthread_join() in main thread before freeing tun object would be enough to avoid this situation.
Updated by pespin almost 2 years ago
- Status changed from New to Feedback
- % Done changed from 0 to 90
Fixes here:
https://gerrit.osmocom.org/c/osmo-ttcn3-hacks/+/27740 GTPv2_Emulation: Increase reset_all_stats timeout
https://gerrit.osmocom.org/c/osmo-uecups/+/27739 Fix use-after-free by tun thread after tun obj destroyed
The required fix was more complex than I initially envisioned, since both threads are accessing the same mutex (and the mutex functions are not cancellation points).
Updated by pespin over 1 year ago
- Status changed from Feedback to Resolved
- % Done changed from 90 to 100
Merged, closing.