Project

General

Profile

Bug #2560

osmo-bts-trx crash with sigabrt

Added by msuraev 14 days ago. Updated 8 days ago.

Status:
In Progress
Priority:
High
Assignee:
Category:
-
Target version:
-
Start date:
10/06/2017
Due date:
% Done:

50%

Spec Reference:

Description

It's been reported that osmo-bts-trx crashes under certain conditions on Ubuntu 16.04 x86_64.

Attached is a crashfile (detail can be extracted with apport-unpack) and config files and pcaps.

_usr_local_osmo-bts_src_osmo-bts-trx_osmo-bts-trx.0.crash (210 KB) msuraev, 10/06/2017 03:14 PM

openbsc_osmobsc.conf Magnifier (6.22 KB) msuraev, 10/06/2017 03:16 PM

osmo-bts.cfg (624 Bytes) msuraev, 10/06/2017 03:16 PM

osmo-bts-trx core dumped.pcap (61.7 KB) msuraev, 10/06/2017 03:16 PM

osmo-bts-trx (1.5 MB) msuraev, 10/09/2017 02:44 PM

History

#1 Updated by msuraev 11 days ago

Backtrace:

Core was generated by `/usr/local/osmo-bts/src/osmo-bts-trx/osmo-bts-trx -c /root/osmocom_files/osmo-b'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007fd6a8e24428 in read_alias_file (fname=<optimized out>, fname_len=<optimized out>) at localealias.c:325
325     localealias.c: No such file or directory.
(gdb) bt
#0  0x00007fd6a8e24428 in read_alias_file (fname=<optimized out>, fname_len=<optimized out>) at localealias.c:325
#1  0x0000000000413692 in down_fom (msg=0xbf7570, bts=0xc2e620) at oml.c:1128
#2  down_oml (bts=0xc2e620, msg=0xbf7570) at oml.c:1437
#3  0x000000000042532d in sign_link_cb (msg=<optimized out>) at abis.c:166
#4  0x00007fd6a97f2dc4 in ?? ()
#5  0x0000000000cba648 in ?? ()
...

#2 Updated by msuraev 11 days ago

#3 Updated by laforge 9 days ago

  • Priority changed from Normal to High

#4 Updated by msuraev 8 days ago

  • % Done changed from 0 to 50

Workaround in gerrit 4232 should prevent the crash. The reason, as pointed out by Neels in ML, is the OSMO_ASSERT(trx); in trx_phy_instance(). This is triggered in osmo-bts-trx (although I did not manage to reproduce it locally) when attribute request arrives at the time when TRX is not yet available. I suspect this is due to missing/in-progress connection to osmo-trx.

The right fix would be to only reply when TRX is available. But that would require either to store the request and properly plug responder into TRX init or to make sure that TRX is always available by delaying osmo-bts-trx connection to BSC until it's ready. Not sure if either is worth pursuing ATM.

Alternatively, BSC can detect this situation and re-request attributes later on (not sure at which point though).

The downside of the workaround in gerrit 4232 is that some TRX-specific attributes might not be reported to BSC. So far it's purely informational: the only thing we do with the response is logging.

#5 Updated by msuraev 8 days ago

  • Status changed from New to In Progress

#6 Updated by laforge 8 days ago

Hi Max,

On Thu, Oct 12, 2017 at 12:13:40PM +0000, msuraev [REDMINE] wrote:

Workaround in gerrit 4232 should prevent the crash. The reason, as
pointed out by Neels in ML, is the OSMO_ASSERT(trx); in
trx_phy_instance(). This is triggered in osmo-bts-trx (although I did
not manage to reproduce it locally) when attribute request arrives at
the time when TRX is not yet available. I suspect this is due to
missing/in-progress connection to osmo-trx.

The right fix would be to only reply when TRX is available. But that
would require either to store the request and properly plug responder
into TRX init or to make sure that TRX is always available by delaying
osmo-bts-trx connection to BSC until it's ready. Not sure if either is
worth pursuing ATM.

There is an alternative: Simply reply with an error, or with an empty
response ("no attributes"). Crashing osmo-bts-trx is not a good way of
handling this.

Alternatively, BSC can detect this situation and re-request attributes
later on (not sure at which point though).

It could be a periodic timer with something like 3 tries, after which
point the OML connection is dropped.

The downside of the workaround in gerrit 4232 is that some
TRX-specific attributes might not be reported to BSC. So far it's
purely informational: the only thing we do with the response is
logging.

Yes, but that's just the status quo. We need this to work properly,
and/or fail gracefully in order to be able to use the attributes. Let's
not create a chicken-and-egg situation here, where in the future we'll
then say "well yes, ideally we could use the attributes, but then
they aren't reported reliably".

#7 Updated by msuraev 8 days ago

laforge wrote:

There is an alternative: Simply reply with an error, or with an empty
response ("no attributes").

That's what patch in gerrit 4232 does.

It could be a periodic timer with something like 3 tries, after which
point the OML connection is dropped.

I'm not sure how reliable it would be: after the OML connection is dropped, osmo-bts will be restarted (by systemd for example), try to connect again and so on.

Anyway, to implement it properly I have to reproduce the crash first.

Also available in: Atom PDF