Project

General

Profile

Actions

Bug #5523

open

sigsegv race condition destroying everything due to "reset_all_state"

Added by pespin about 1 month ago. Updated about 1 month ago.

Status:
Feedback
Priority:
Normal
Assignee:
Target version:
-
Start date:
04/11/2022
Due date:
% Done:

90%

Spec Reference:

Description

There seems to be some use-after-free when a tunnel is freed. This seems to be a race conditions between thread "tun_device_thread" and the main thread.

It can be triggered by running PGW_Tests.TC_createSession_ping4_256. It doesn't always show a backtrace though, probably depending on the memory chunk state after being freed. In general osmo-uecups seems to be end up in some weird hanging stuff, which can be seen by running extra tests afterwards. It actually never answers the "reset_all_state" with "reset_all_state_res".
PGW_Tests.TC_createSession_ping4

20220411142037407 DTUN tun_device.c:391 ping251: Destroying
20220411142037407 DGT gtp_tunnel.c:136 ping252-Rd39cdfc4-T000003f2: Destroying
20220411142037407 DEP gtp_endpoint.c:229 172.18.18.20:2152: Release; new use_count=3
20220411142037408 DTUN tun_device.c:391 ping252: Destroying
20220411142037408 DGT gtp_tunnel.c:136 ping253-Rf59cc758-T000003f6: Destroying
20220411142037409 DEP gtp_endpoint.c:229 172.18.18.20:2152: Release; new use_count=2
20220411142037409 DTUN tun_device.c:391 ping253: Destroying
20220411142037409 DGT gtp_tunnel.c:136 ping254-R104d44b1-T000003fa: Destroying
20220411142037410 DEP gtp_endpoint.c:229 172.18.18.20:2152: Release; new use_count=1
[Thread 0x7f2e5c510700 (LWP 778) exited]
20220411142037467 DTUN tun_device.c:391 ping254: Destroying
20220411142037467 DGT gtp_tunnel.c:136 ping255-R24acc174-T000003fe: Destroying
20220411142037467 DEP gtp_endpoint.c:183 172.18.18.20:2152: Destroying
20220411142037467 DTUN tun_device.c:391 ping255: Destroying
[Thread 0x7f2e7b54e700 (LWP 592) exited]
[Thread 0x7f2e5bd0f700 (LWP 779) exited]

Thread 257 "osmo-uecups-dae" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f2e5b50e700 (LWP 784)]
0x0000564ea59b2e19 in _gtp_tunnel_find_eua (tun=0x564ea5f670e0,
    sa=0x7f2e5b4fac20, proto=58 ':') at gtp_tunnel.c:125
125             llist_for_each_entry(t, &d->gtp_tunnels, list) {
#0  0x0000564ea59b2e19 in _gtp_tunnel_find_eua (tun=0x564ea5f670e0,
    sa=0x7f2e5b4fac20, proto=58 ':') at gtp_tunnel.c:125
#1  0x0000564ea59afe0e in tun_device_thread (arg=0x564ea5f670e0)
    at tun_device.c:166
#2  0x00007f2edb8cdea7 in start_thread ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00007f2edb692def in clone () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) quit
A debugging session is active.
MAIN THREAD:
cups_client_handle_json
    cups_client_handle_reset_all_state
        pthread_rwlock_wrlock
        foreach_tun { _gtp_tunnel_destroy }
            _gtp_endpoint_release
                _gtp_endpoint_destroy (ep->use_count is 0)
                    pthread_cancel(ep->thread);
            _tun_device_release
                    _tun_device_destroy (tun->use_count is 0)
                        pthread_cancel(tun->thread);
                        close(tun->fd);
        pthread_rwlock_unlock

Thread 257:
tun_device_thread(data = tun)
    read(tun->fd)
    pthread_rwlock_rdlock
        _gtp_tunnel_find_eua(tun) <-- accessing tun crashes
    pthread_rwlock_unlock

The problem here seems to be that tun pointer being passed to the thread is not fully protected through the rwlock, only partially. Which means read() can happen properly, then main thread freed the tun object, then tun pointer is accessed in the worker thread.

This happens probably because pthread_cancel will stop the thread only when reaching syscall points.

Actions #1

Updated by pespin about 1 month ago

man pthread_cancel:

       A thread's cancellation type, determined by pthread_setcanceltype(3), may be either asynchronous or deferred (the
       default  for new threads).  Asynchronous cancelability means that the thread can be canceled at any time (usually
       immediately, but the system does not guarantee this).  Deferred cancelability means that cancellation will be de‐
       layed  until  the thread next calls a function that is a cancellation point.  A list of functions that are or may
       be cancellation points is provided in pthreads(7).

Actions #2

Updated by pespin about 1 month ago

Probably calling pthread_join() in main thread before freeing tun object would be enough to avoid this situation.

Actions #3

Updated by pespin about 1 month ago

  • Status changed from New to Feedback
  • % Done changed from 0 to 90

Fixes here:
https://gerrit.osmocom.org/c/osmo-ttcn3-hacks/+/27740 GTPv2_Emulation: Increase reset_all_stats timeout
https://gerrit.osmocom.org/c/osmo-uecups/+/27739 Fix use-after-free by tun thread after tun obj destroyed

The required fix was more complex than I initially envisioned, since both threads are accessing the same mutex (and the mutex functions are not cancellation points).

Actions

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)