Project

General

Profile

Actions

Bug #4181

closed

osmo-trx-uhd: Crash during physical unplug of device

Added by pespin over 4 years ago. Updated over 4 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
08/29/2019
Due date:
% Done:

0%

Spec Reference:

Description

I was running my network using osmo-trx-uhd with an Ettus B200 and I unplugged the devie. Got this:

Thu Aug 29 20:55:42 2019 DMAIN <0000> Transceiver.cpp:1113 [tid=139657724520192] ClockInterface: sending IND CLOCK 738780
Thu Aug 29 20:55:43 2019 DMAIN <0000> Transceiver.cpp:1113 [tid=139657724520192] ClockInterface: sending IND CLOCK 738997
Thu Aug 29 20:55:44 2019 DMAIN <0000> Transceiver.cpp:1113 [tid=139657724520192] ClockInterface: sending IND CLOCK 739213
terminate called after throwing an instance of 'uhd::io_error'
  what():  EnvironmentError: IOError: usb rx6 transfer status: LIBUSB_TRANSFER_NO_DEVICE
[ERROR] [UHD] signal 6 received
An unexpected exception was caught in a task loop.The task loop will now exit, things may not work.EnvironmentError: IOError: usb rx8 transfer status: LIBUSB_TRANSFER_NO_DEVICE
talloc report on 'OsmoTRX' (total   5246 bytes in  21 blocks)
    /home/pespin/dev/sysmocom/git/libosmocore/src/rate_ctr.c:234 contains    512 bytes in   1 blocks (ref 0) 0x6160000057e0
    /home/pespin/dev/sysmocom/git/osmo-trx/CommonLibs/trx_rate_ctr.cpp:276 contains      8 bytes in   1 blocks (ref 0) 0x60b0000c3130
    /home/pespin/dev/sysmocom/git/osmo-trx/CommonLibs/trx_rate_ctr.cpp:275 contains     32 bytes in   1 blocks (ref 0) 0x60c000023620
    telnet_connection              contains      1 bytes in   1 blocks (ref 0) 0x60b0000c2370
    logging                        contains   4303 bytes in  11 blocks (ref 0) 0x60b0000155a0
    struct trx_ctx                 contains    390 bytes in   4 blocks (ref 0) 0x6140000006a0
    msgb                           contains      0 bytes in   1 blocks (ref 0) 0x608000005f80
full talloc report on 'OsmoTRX' (total   5246 bytes in  21 blocks)
...
./run_out.sh: line 12: 28572 Aborted                 (core dumped) $@

(./run_out.sh is the bash script I use to launch osmo-trx-uhd).

So it seems we are not handling a UHD exception in UHDDevice which ends up aborting the entire process. We should handle it and stop osmo-trx-uhd process gracefully through the osmo signal available for that purpose.

Actions #2

Updated by pespin over 4 years ago

  • Status changed from New to Closed

As far as I understand, the UHD code is run in a separate thread created by UHD's task::make: https://files.ettus.com/manual/classuhd_1_1task.html

        task_handler = task::make(
            boost::bind(&libusb_session_impl::libusb_event_handler_task, this, _context));

As a result, we have no access to catching c++ exceptions from that thread, and the c++ exception ends up calling abort() which sends SIGABRT to the process ("signal 6 received").

The current osmo-trx signal handler:

static void sig_handler(int signo)
{

    if (gshutdown)
        /* We are in the middle of shutdown process, avoid any kind of extra
           action like printing */
        return;

    fprintf(stderr, "signal %d received\n", signo);
    switch (signo) {
    case SIGINT:
    case SIGTERM:
        fprintf(stderr, "shutting down\n");
        gshutdown = true;
        break;
    case SIGABRT:
    case SIGUSR1:
        talloc_report(tall_trx_ctx, stderr);
        talloc_report_full(tall_trx_ctx, stderr);
        break;
    case SIGUSR2:
        talloc_report_full(tall_trx_ctx, stderr);
        break;
    case SIGHUP:
        log_targets_reopen();
    default:
        break;
    }

}

Unfortunately there doesn't seem to be a way to handle things fine and shut down properly in this scenario (abort called). According to POSIX:

The abort() function shall cause abnormal process termination to occur, unless the signal SIGABRT is being caught and the signal handler does not return.

I don't see how we can avoid the signal handler stopping (other than by stopping/cancelling the thread or making it block forever, specially since we use signalfd so the process executing the abort signal is probably the main one. It's probably better to let it continue so a core dump is generated (not sure if it makes much sense though since anyway other threads keep runing...).

So I'm closing the ticket.

Actions

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)