Project

General

Profile

Bug #2560

osmo-bts-trx crash with sigabrt

Added by msuraev 10 months ago. Updated 2 months ago.

Status:
Stalled
Priority:
Low
Assignee:
Category:
-
Target version:
-
Start date:
10/06/2017
Due date:
% Done:

60%

Estimated time:
Spec Reference:

Description

It's been reported that osmo-bts-trx crashes under certain conditions on Ubuntu 16.04 x86_64.

Attached is a crashfile (detail can be extracted with apport-unpack) and config files and pcaps.

_usr_local_osmo-bts_src_osmo-bts-trx_osmo-bts-trx.0.crash _usr_local_osmo-bts_src_osmo-bts-trx_osmo-bts-trx.0.crash 210 KB msuraev, 10/06/2017 03:14 PM
openbsc_osmobsc.conf openbsc_osmobsc.conf 6.22 KB msuraev, 10/06/2017 03:16 PM
osmo-bts.cfg osmo-bts.cfg 624 Bytes msuraev, 10/06/2017 03:16 PM
osmo-bts-trx core dumped.pcap osmo-bts-trx core dumped.pcap 61.7 KB msuraev, 10/06/2017 03:16 PM
osmo-bts-trx osmo-bts-trx 1.5 MB msuraev, 10/09/2017 02:44 PM

History

#1 Updated by msuraev 9 months ago

Backtrace:

Core was generated by `/usr/local/osmo-bts/src/osmo-bts-trx/osmo-bts-trx -c /root/osmocom_files/osmo-b'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007fd6a8e24428 in read_alias_file (fname=<optimized out>, fname_len=<optimized out>) at localealias.c:325
325     localealias.c: No such file or directory.
(gdb) bt
#0  0x00007fd6a8e24428 in read_alias_file (fname=<optimized out>, fname_len=<optimized out>) at localealias.c:325
#1  0x0000000000413692 in down_fom (msg=0xbf7570, bts=0xc2e620) at oml.c:1128
#2  down_oml (bts=0xc2e620, msg=0xbf7570) at oml.c:1437
#3  0x000000000042532d in sign_link_cb (msg=<optimized out>) at abis.c:166
#4  0x00007fd6a97f2dc4 in ?? ()
#5  0x0000000000cba648 in ?? ()
...

#2 Updated by msuraev 9 months ago

#3 Updated by laforge 9 months ago

  • Priority changed from Normal to High

#4 Updated by msuraev 9 months ago

  • % Done changed from 0 to 50

Workaround in gerrit 4232 should prevent the crash. The reason, as pointed out by Neels in ML, is the OSMO_ASSERT(trx); in trx_phy_instance(). This is triggered in osmo-bts-trx (although I did not manage to reproduce it locally) when attribute request arrives at the time when TRX is not yet available. I suspect this is due to missing/in-progress connection to osmo-trx.

The right fix would be to only reply when TRX is available. But that would require either to store the request and properly plug responder into TRX init or to make sure that TRX is always available by delaying osmo-bts-trx connection to BSC until it's ready. Not sure if either is worth pursuing ATM.

Alternatively, BSC can detect this situation and re-request attributes later on (not sure at which point though).

The downside of the workaround in gerrit 4232 is that some TRX-specific attributes might not be reported to BSC. So far it's purely informational: the only thing we do with the response is logging.

#5 Updated by msuraev 9 months ago

  • Status changed from New to In Progress

#6 Updated by laforge 9 months ago

Hi Max,

On Thu, Oct 12, 2017 at 12:13:40PM +0000, msuraev [REDMINE] wrote:

Workaround in gerrit 4232 should prevent the crash. The reason, as
pointed out by Neels in ML, is the OSMO_ASSERT(trx); in
trx_phy_instance(). This is triggered in osmo-bts-trx (although I did
not manage to reproduce it locally) when attribute request arrives at
the time when TRX is not yet available. I suspect this is due to
missing/in-progress connection to osmo-trx.

The right fix would be to only reply when TRX is available. But that
would require either to store the request and properly plug responder
into TRX init or to make sure that TRX is always available by delaying
osmo-bts-trx connection to BSC until it's ready. Not sure if either is
worth pursuing ATM.

There is an alternative: Simply reply with an error, or with an empty
response ("no attributes"). Crashing osmo-bts-trx is not a good way of
handling this.

Alternatively, BSC can detect this situation and re-request attributes
later on (not sure at which point though).

It could be a periodic timer with something like 3 tries, after which
point the OML connection is dropped.

The downside of the workaround in gerrit 4232 is that some
TRX-specific attributes might not be reported to BSC. So far it's
purely informational: the only thing we do with the response is
logging.

Yes, but that's just the status quo. We need this to work properly,
and/or fail gracefully in order to be able to use the attributes. Let's
not create a chicken-and-egg situation here, where in the future we'll
then say "well yes, ideally we could use the attributes, but then
they aren't reported reliably".

#7 Updated by msuraev 9 months ago

laforge wrote:

There is an alternative: Simply reply with an error, or with an empty
response ("no attributes").

That's what patch in gerrit 4232 does.

It could be a periodic timer with something like 3 tries, after which
point the OML connection is dropped.

I'm not sure how reliable it would be: after the OML connection is dropped, osmo-bts will be restarted (by systemd for example), try to connect again and so on.

Anyway, to implement it properly I have to reproduce the crash first.

#8 Updated by msuraev 9 months ago

  • Status changed from In Progress to Stalled
  • % Done changed from 50 to 60

Gerrit 4232 has been merged.

#9 Updated by laforge 5 months ago

  • Assignee deleted (msuraev)

#10 Updated by laforge 3 months ago

  • Assignee set to lynxis

#11 Updated by laforge 2 months ago

  • Priority changed from High to Low

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)