Project

General

Profile

Bug #1761

LAPD: segfault when bootstrapping Nokia InSite

Added by laforge about 4 years ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Nokia BTS
Target version:
-
Start date:
07/03/2016
Due date:
% Done:

100%

Spec Reference:

Description

When bootstrapping a Nokia InSite BTS, current OsmoNITB segfaults.

The reason for this is as follows:

  • ABM is established.
  • LAPD code hands an I frame to the application using send_dl_l3()
  • user application decides to call lapd_sap_stop() resulting in a local RELEASE request to LAPD
  • LAPD clears the transmit history and changes to IDLE state
  • application returns from processing the I frame
  • code proceeds in lapd_rx_i() and tries to transmit an I frame, as it didn't realize the state has meanwhile changed
  • lapd_send_i() tries to use dl->tx_hist -> boom.

As this is the second bug related to accessing a free'd tx_hist, the code seems to require a more thorough audit.


Related issues

Related to libosmocore - Bug #1760: LAPD: segfault in T200 call-backClosed07/03/2016

Related to libosmocore - Bug #1762: Review LAPD code for race conditions regarding state, particularly in RELEASENew07/03/2016

Related to OsmoBSC - Bug #3975: osmo-bsc crash during startup with nokia insiteClosed05/04/2019

Related to libosmocore - Bug #4646: SEGV when bringing up Nokia InSiteResolved07/04/2020

Related to libosmocore - Bug #1982: LAPD: segfault in lapd_est_req functionResolved03/14/2017

History

#1 Updated by laforge about 4 years ago

  • Related to Bug #1760: LAPD: segfault in T200 call-back added

#2 Updated by laforge about 4 years ago

  • Status changed from New to In Progress
  • % Done changed from 0 to 20

The quick fix for this specific bug is to check for LAPD_STATE_MF_EST in the first lines of labd_send_i(), and return if not. Not sure how many other similar bugs are still hidden :/

#3 Updated by laforge about 4 years ago

  • Related to Bug #1762: Review LAPD code for race conditions regarding state, particularly in RELEASE added

#4 Updated by laforge almost 4 years ago

  • Assignee deleted (laforge)

#5 Updated by laforge almost 3 years ago

  • Status changed from In Progress to New

#6 Updated by laforge over 1 year ago

  • Assignee set to laforge

#7 Updated by tnt over 1 year ago

  • Related to Bug #3975: osmo-bsc crash during startup with nokia insite added

#8 Updated by laforge 5 months ago

  • Status changed from New to In Progress

I've started to inviestigate this. Finding a way to solving it is indeed quite tricky so far.

#9 Updated by laforge 5 months ago

bts_nokia_site

lapd_sap_{start,stop} are called as follows:

  • S_L_INP_LINE_INIT (start OML)
  • S_L_INP_TEI_UNKNOWN (start RSL)
  • reset_timer_cb() (stop all; start OML)
  • when ACK for RESET was received (stop all)
  • ACK for CONF_DATA was received (start RSL)

bts_ericsson_om2000

lapd_sap_{start,stop} are called as follows:

  • S_L_INP_LINE_INIT (start)
  • S_L_INP_LINE_NOALARM (start)
  • S_L_INP_LINE_ALARM (stop)

conclusion so far

  • the critical part is the lapd_sap_stop()
  • it's only critical when used from code paths that will use the SAP afterwards
  • input signals should not do this, they are dispatched from driver code
    • S_L_INP_LINE_INIT is only generated by e1inp_line_update() which is called from vty
    • S_L_INP_LINE_ALARM + S_L_INP_LINE_NOALARM is currently only generated by DAHDI and called when read/write returns an error or the fd is in exceptfds during select
    • LAPD_ERR_UNKNOWN_TEI is generated by LAPD code after all processing
So this means it can currently only be triggered in the Nokia code, and it's likely one of the non-signal cases:
  • reset_timer_cb() (stop all; start OML)
    • called from osmocom timer abstraction; ruled out
  • ACK for CONF_DATA was received (start RSL)
    • only starts the LAPD link
  • when ACK for RESET was received (stop all)
    • this looks like the only code path causing it. We receive an OML message (LAPD I-frame), and during processing of that message we stop the datalink, and then return back into LAPD I-frame processing but the data link is gone.

#10 Updated by laforge 5 months ago

fundamentally, this is one of the drawbacks of our 'call everything in-line/synchronous' architecture. This has proven to be suboptimal in a variety of sitations already, such as FSM event dispatch, or also here.

If the L3 entity (user of the DL-SAP provided by LAPD) would have an inbound message queue, we would not process the L3 message in-line, but simply put it in the queue and terminate LAPD processing. Once that is finished, we (or some scheduler) would check if there is new data in the L3 input queue, which would process the event. At that point, deleting the LAPD instance was no longer a problem....

#11 Updated by laforge 5 months ago

For now the only trivial solution I can see is to consider removing the DL SAP instance during L3 message processing illegal. The Nokia BTS driver must do so asynchronously.

#12 Updated by laforge 5 months ago

  • Project changed from libosmocore to OsmoBSC
  • Category set to Nokia BTS
  • % Done changed from 20 to 90

Proposed fix in https://gerrit.osmocom.org/c/osmo-bsc/+/18009 - still needs manual testing/verification

#13 Updated by laforge 3 months ago

  • Status changed from In Progress to Resolved
  • % Done changed from 90 to 100

tnt has reported that the [meanwhile merged] fix solves the problem.

#14 Updated by laforge 3 months ago

  • Related to Bug #4646: SEGV when bringing up Nokia InSite added

#15 Updated by laforge 2 months ago

  • Related to Bug #1982: LAPD: segfault in lapd_est_req function added

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)