Project

General

Profile

Actions

Feature #5983

closed

Exit if osmo-e1d goes down or is restarted

Added by keith 11 months ago. Updated 8 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Ericsson BTS
Target version:
-
Start date:
03/29/2023
Due date:
% Done:

100%

Spec Reference:

Description

If something happens to osmo-e1d while osmo-bsc is running and connected, osmo-bsc will spam with log with

DLMI ERROR input/e1d.c:74 E1TS(0:1) handle_ts_sign_read read failed 0 (Broken pipe)

This causes high cpu usage of osmo-bsc, a terminal it may be running in and possibly also systemd-journald, and quite some disk access.

On one hardware type/configuration this appears to cause the network to become unresponsive and I couldn't even stop osmo-bsc via ssh. That's not something to fix here, but at least osmo-bsc should probably exit in this case, as it will not recover when osmo-e1d comes back on line .


Related issues

Related to osmo-e1d - Bug #4916: USB unplug / replug renders e1d unusableStalledlaforge12/18/2020

Actions
Related to OsmoBSC - Feature #5586: Ericsson RBS could recover Unlocked state without osmo-bsc restartNew06/22/2022

Actions
Actions #1

Updated by laforge 11 months ago

  • Assignee changed from keith to dexter

This is likely a somewhat unique situation (for libosmo-abis), as in DAHDI or mISDN the underlying device never disappears. I'm not sure how easy it is to fix.

One alternative would be to use the dahdi driver for the icE1usb. This hass been used by a number of people (not with osmo-bsc, but in ISDN setups) so I'm rather confident about its usability.

However, the "cost" of that is that one needs to build https://gitea.osmocom.org/retronetworking/dahdi-linux - out of tree kernel modules newer were fun :/

Maybe dexter can have a look at how complex it is to improve libosmo-abis e1d support in that regard (close TS devices when e1d disappears, try to re-connect periodically and re-open once its back). I guess it would end up needing a related FSM taking care of those tasks.

Actions #2

Updated by laforge 11 months ago

  • Related to Bug #4916: USB unplug / replug renders e1d unusable added
Actions #3

Updated by keith 11 months ago

  • Related to Feature #5586: Ericsson RBS could recover Unlocked state without osmo-bsc restart added
Actions #4

Updated by keith 11 months ago

Given your comments, I'm adding a possibly related feature request.

I had hacked a kind of loop detection into osmo-bsc in huautla when there was an ericsson there that would exit after logging "discarding RSL message received in locked administrative state" several times, because osmo-bsc also never recovers from that state.

Actions #5

Updated by dexter 11 months ago

  • Status changed from New to In Progress
Actions #6

Updated by dexter 11 months ago

I have had a look at it. The problem also exists in osmo-mgw since it uses the same API. I can use the E1 TTCN3 tests to reproduce it by killing osmo-e1d while the test is running.

I think this is fixable. There should be an FSM that tries to recover the connection to osmo-e1d. When the connection it lost it would block the API calls and when the connection is recovered it would refresh the file descriptors and unblock the API calls again. This would still interrupt the connection but the behavior should be more like an unplugged E1 cable rather then an unresponsive E1 controller card. (I am not sure if this alone will fix the CPU load issue, we might need some rate limiting here as well.)

Actions #7

Updated by dexter 11 months ago

  • % Done changed from 0 to 50

I have it roughly working now. I can now restart osmo-e1d while TC_e1_crcx_loopback is running. When osmo-e1d I see TRAU frames flowing again. Unfortunately it still hangs sometimes, possibly because the disconnect is not detected properly.

The endless-loop problem seems to go away when the bfds are properly closed and unregistered. What makes me wonder though is the fact that it goes so haywire when the file descriptor gets lost. e1d_fd_cb() seems to be called in an endless loop. Maybe that is something we can detect?

Actions #8

Updated by dexter 10 months ago

  • % Done changed from 50 to 80

I think I have fixed the problem now. I can now reliably restart osmo-e1d while TC_e1_crcx_loopback is running. Its also no problem when osmo-e1d starts late.

A patch is in gerrit now: https://gerrit.osmocom.org/c/libosmo-abis/+/32374 e1d: reconnect to osmo-e1d after connection loss

I have also pushed everything on a private branch in libosmo-abis.git: pmaier/e1d_reconnect - I think it would be good if you (keith) could give it a try on your side.

Actions #9

Updated by dexter 10 months ago

Unfortunately the CPU load issue is still present. The CPU load peaks when osmo-e1d is terminated while there is no load on osmo-mgw. The file descriptor used for traffic should not be the issue since osmo-mgw sets unused E1 timeslots to none, which should close those file descriptors. I have the feeling that there is something with the file descriptor used for control traffic.

Actions #10

Updated by dexter 10 months ago

I think I have found the problem. It was indeed the control socket that wasn't closed. When osmo-e1d dies the client goes haywire because the file descriptor is flooded with POLLHUP callbacks. The problem vanishes when the file descriptor is closed properly.

https://gerrit.osmocom.org/c/osmo-e1d/+/32388 proto_clnt: close osmo-e1d control socket on connection loss

Actions #11

Updated by dexter 10 months ago

The problem is fixed for the osmo-e1d driver in libosmo-abis now. The patch-set for this is currently in review.

Actions #12

Updated by dexter 10 months ago

Most of the patches made it into master, what's left is:

https://gerrit.osmocom.org/c/libosmo-abis/+/32569 e1d: get rid of strange file descriptor registered check [NEW]
https://gerrit.osmocom.org/c/libosmo-abis/+/32374 e1d: reconnect to osmo-e1d after connection loss

and one trivial patch in osmo-e1d.git: https://gerrit.osmocom.org/c/osmo-e1d/+/32393

Actions #13

Updated by dexter 10 months ago

  • % Done changed from 80 to 90

The following to patches are still in review, as far as I can see there are no open review issues at the moment.

https://gerrit.osmocom.org/c/libosmo-abis/+/32569 e1d: get rid of strange file descriptor registered check
https://gerrit.osmocom.org/c/libosmo-abis/+/32374 e1d: reconnect to osmo-e1d after connection loss

Actions #14

Updated by dexter 9 months ago

The following patch is still in review, there are no open issues at the moment:
https://gerrit.osmocom.org/c/libosmo-abis/+/32374 e1d: reconnect to osmo-e1d after connection loss

Actions #15

Updated by dexter 9 months ago

The following patch is still in review. There are not enough review points yet but at least from my perspective I think that it is ready for merging.
https://gerrit.osmocom.org/c/libosmo-abis/+/32374 e1d: reconnect to osmo-e1d after connection loss

Actions #16

Updated by dexter 8 months ago

We discovered that there is a problem with incomplete read/write syscalls. As it seems that is not much of an issue in practice since the problem existed before as well. I have added FIXME notes in the code.

Actions #17

Updated by dexter 8 months ago

  • Status changed from In Progress to Resolved
  • % Done changed from 90 to 100

The e1d changes made it into master now. We can close this ticket now.

Actions

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)