Bug #2560
closedosmo-bts-trx crash with sigabrt
Added by msuraev over 6 years ago. Updated almost 4 years ago.
90%
Description
It's been reported that osmo-bts-trx crashes under certain conditions on Ubuntu 16.04 x86_64.
Attached is a crashfile (detail can be extracted with apport-unpack
) and config files and pcaps.
Files
_usr_local_osmo-bts_src_osmo-bts-trx_osmo-bts-trx.0.crash | _usr_local_osmo-bts_src_osmo-bts-trx_osmo-bts-trx.0.crash | 210 KB | msuraev, 10/06/2017 03:14 PM | ||
openbsc_osmobsc.conf | openbsc_osmobsc.conf | 6.22 KB | msuraev, 10/06/2017 03:16 PM | ||
osmo-bts.cfg | osmo-bts.cfg | 624 Bytes | msuraev, 10/06/2017 03:16 PM | ||
osmo-bts-trx core dumped.pcap | osmo-bts-trx core dumped.pcap | 61.7 KB | msuraev, 10/06/2017 03:16 PM | ||
osmo-bts-trx | osmo-bts-trx | 1.5 MB | msuraev, 10/09/2017 02:44 PM |
Updated by msuraev over 6 years ago
Backtrace:
Core was generated by `/usr/local/osmo-bts/src/osmo-bts-trx/osmo-bts-trx -c /root/osmocom_files/osmo-b'. Program terminated with signal SIGABRT, Aborted. #0 0x00007fd6a8e24428 in read_alias_file (fname=<optimized out>, fname_len=<optimized out>) at localealias.c:325 325 localealias.c: No such file or directory. (gdb) bt #0 0x00007fd6a8e24428 in read_alias_file (fname=<optimized out>, fname_len=<optimized out>) at localealias.c:325 #1 0x0000000000413692 in down_fom (msg=0xbf7570, bts=0xc2e620) at oml.c:1128 #2 down_oml (bts=0xc2e620, msg=0xbf7570) at oml.c:1437 #3 0x000000000042532d in sign_link_cb (msg=<optimized out>) at abis.c:166 #4 0x00007fd6a97f2dc4 in ?? () #5 0x0000000000cba648 in ?? () ...
Updated by msuraev over 6 years ago
- % Done changed from 0 to 50
Workaround in gerrit 4232 should prevent the crash. The reason, as pointed out by Neels in ML, is the OSMO_ASSERT(trx); in trx_phy_instance(). This is triggered in osmo-bts-trx (although I did not manage to reproduce it locally) when attribute request arrives at the time when TRX is not yet available. I suspect this is due to missing/in-progress connection to osmo-trx.
The right fix would be to only reply when TRX is available. But that would require either to store the request and properly plug responder into TRX init or to make sure that TRX is always available by delaying osmo-bts-trx connection to BSC until it's ready. Not sure if either is worth pursuing ATM.
Alternatively, BSC can detect this situation and re-request attributes later on (not sure at which point though).
The downside of the workaround in gerrit 4232 is that some TRX-specific attributes might not be reported to BSC. So far it's purely informational: the only thing we do with the response is logging.
Updated by laforge over 6 years ago
Hi Max,
On Thu, Oct 12, 2017 at 12:13:40PM +0000, msuraev [REDMINE] wrote:
Workaround in gerrit 4232 should prevent the crash. The reason, as
pointed out by Neels in ML, is the OSMO_ASSERT(trx); in
trx_phy_instance(). This is triggered in osmo-bts-trx (although I did
not manage to reproduce it locally) when attribute request arrives at
the time when TRX is not yet available. I suspect this is due to
missing/in-progress connection to osmo-trx.The right fix would be to only reply when TRX is available. But that
would require either to store the request and properly plug responder
into TRX init or to make sure that TRX is always available by delaying
osmo-bts-trx connection to BSC until it's ready. Not sure if either is
worth pursuing ATM.
There is an alternative: Simply reply with an error, or with an empty
response ("no attributes"). Crashing osmo-bts-trx is not a good way of
handling this.
Alternatively, BSC can detect this situation and re-request attributes
later on (not sure at which point though).
It could be a periodic timer with something like 3 tries, after which
point the OML connection is dropped.
The downside of the workaround in gerrit 4232 is that some
TRX-specific attributes might not be reported to BSC. So far it's
purely informational: the only thing we do with the response is
logging.
Yes, but that's just the status quo. We need this to work properly,
and/or fail gracefully in order to be able to use the attributes. Let's
not create a chicken-and-egg situation here, where in the future we'll
then say "well yes, ideally we could use the attributes, but then
they aren't reported reliably".
Updated by msuraev over 6 years ago
laforge wrote:
There is an alternative: Simply reply with an error, or with an empty
response ("no attributes").
That's what patch in gerrit 4232 does.
It could be a periodic timer with something like 3 tries, after which
point the OML connection is dropped.
I'm not sure how reliable it would be: after the OML connection is dropped, osmo-bts will be restarted (by systemd for example), try to connect again and so on.
Anyway, to implement it properly I have to reproduce the crash first.
Updated by msuraev over 6 years ago
- Status changed from In Progress to Stalled
- % Done changed from 50 to 60
Gerrit 4232 has been merged.
Updated by pespin over 4 years ago
- Status changed from Stalled to Feedback
- Assignee set to pespin
- % Done changed from 60 to 90
- BSC/BTS is not correctly configured and it's asking for a TRX which is not allocated by BTS
- the TRX was not yet created and added to bts->trx_list during gsm_bts_trx_alloc().
First case is a config issue and it's already fixed (the crash).
Second case: let's check if it's possible with current code base:
- For first TRX: during gsm_bts_alloc() (but initialized later during bts_init()). That's called in main.c really early on before OML is available, so we are safe here.
- For other TRX: during "trx <0-254>" cmd from VTY. Called during main.c vty_read_config_file(). OML conn is only started later in main.c during abis_open(), so we are safe here too.
So imho this is not longer an issue and the ticket can be closed. I'll close it later if nobody disagrees.
Updated by pespin almost 4 years ago
- Status changed from Feedback to Closed
Closing, since nobody disagreed during 6 months, an I never had this kind of crash while operating an osmo-bts-trx.