Project

General

Profile

Bug #1761

LAPD: segfault when bootstrapping Nokia InSite

Added by laforge over 5 years ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Nokia BTS
Target version:
-
Start date:
07/03/2016
Due date:
% Done:

100%

Spec Reference:

Description

When bootstrapping a Nokia InSite BTS, current OsmoNITB segfaults.

The reason for this is as follows:

  • ABM is established.
  • LAPD code hands an I frame to the application using send_dl_l3()
  • user application decides to call lapd_sap_stop() resulting in a local RELEASE request to LAPD
  • LAPD clears the transmit history and changes to IDLE state
  • application returns from processing the I frame
  • code proceeds in lapd_rx_i() and tries to transmit an I frame, as it didn't realize the state has meanwhile changed
  • lapd_send_i() tries to use dl->tx_hist -> boom.

As this is the second bug related to accessing a free'd tx_hist, the code seems to require a more thorough audit.


Related issues

Related to libosmocore - Bug #1760: LAPD: segfault in T200 call-backClosed07/03/2016

Related to libosmocore - Bug #1762: Review LAPD code for race conditions regarding state, particularly in RELEASENew07/03/2016

Related to OsmoBSC - Bug #3975: osmo-bsc crash during startup with nokia insiteClosed05/04/2019

Related to libosmocore - Bug #4646: SEGV when bringing up Nokia InSiteResolved07/04/2020

Related to libosmocore - Bug #1982: LAPD: segfault in lapd_est_req functionResolved03/14/2017

Associated revisions

Revision 456a62e9 (diff)
Added by laforge over 1 year ago

bts_nokia_site: Fix LAPD segfault during reset procedure

The existing Nokia *Site code destroyed the LAPD SAP instance for OML
while processing an OML message. Once the stack frame returned back
to the LAPD code, the LAPD SAP was gone -> segfault.

Let's work around this by moving deletion of the LAPD SAP out-of-line
by starting a timer 0ms in the future. Not particularly nice, but
effective.

Change-Id: I6270c7210f600e53f845561898245d2fd30a368d
Closes: OS#1761

Revision bc1d7152 (diff)
Added by laforge over 1 year ago

lapd_core: After calling into L3, check if the state has changed

While processing an I-frame we may deliver its payload to L3. After
returning from L3 procesing, we run some additional code, assuming
the LAPD/DL state has not changed meanwhile.

However, if the application destroys the LAPD/DL meanwhile, our state
might be NULL again, and in this state we should not perform any further
action.

This is one of the cases where synchronous in-line dispatch across
various layers is hitting us. L3 should have an input queue, and only
start processing after all L2 work has completed and we're about to go
back to sleep in select().

Change-Id: I026b64503511002c13c0f4117648c366c48ecc62
Related: OS#1761
Closes: OS#4646

Revision d2a61179 (diff)
Added by laforge 10 months ago

lapd_core: Don't dereference data link after sending PRIM_DL_REL

We must always send the RELEASE.{indication,confirm} last before
returning from a function. We cannot rely on the datalink to
still be around after the call, as the SAP user might have destroyed
the data link meanwhile.

This fixes a heap use-after-free (at least) with RBS2000 when the BTS
is fully brought up and the OML data link is lost, see OS#1762

Change-Id: I8ccca8d5e5d07b666557afe12ab8ac4910ddfb00
Related: OS#1761
Related: OS#1762

History

#1 Updated by laforge over 5 years ago

  • Related to Bug #1760: LAPD: segfault in T200 call-back added

#2 Updated by laforge over 5 years ago

  • Status changed from New to In Progress
  • % Done changed from 0 to 20

The quick fix for this specific bug is to check for LAPD_STATE_MF_EST in the first lines of labd_send_i(), and return if not. Not sure how many other similar bugs are still hidden :/

#3 Updated by laforge over 5 years ago

  • Related to Bug #1762: Review LAPD code for race conditions regarding state, particularly in RELEASE added

#4 Updated by laforge almost 5 years ago

  • Assignee deleted (laforge)

#5 Updated by laforge about 4 years ago

  • Status changed from In Progress to New

#6 Updated by laforge over 2 years ago

  • Assignee set to laforge

#7 Updated by tnt over 2 years ago

  • Related to Bug #3975: osmo-bsc crash during startup with nokia insite added

#8 Updated by laforge over 1 year ago

  • Status changed from New to In Progress

I've started to inviestigate this. Finding a way to solving it is indeed quite tricky so far.

#9 Updated by laforge over 1 year ago

bts_nokia_site

lapd_sap_{start,stop} are called as follows:

  • S_L_INP_LINE_INIT (start OML)
  • S_L_INP_TEI_UNKNOWN (start RSL)
  • reset_timer_cb() (stop all; start OML)
  • when ACK for RESET was received (stop all)
  • ACK for CONF_DATA was received (start RSL)

bts_ericsson_om2000

lapd_sap_{start,stop} are called as follows:

  • S_L_INP_LINE_INIT (start)
  • S_L_INP_LINE_NOALARM (start)
  • S_L_INP_LINE_ALARM (stop)

conclusion so far

  • the critical part is the lapd_sap_stop()
  • it's only critical when used from code paths that will use the SAP afterwards
  • input signals should not do this, they are dispatched from driver code
    • S_L_INP_LINE_INIT is only generated by e1inp_line_update() which is called from vty
    • S_L_INP_LINE_ALARM + S_L_INP_LINE_NOALARM is currently only generated by DAHDI and called when read/write returns an error or the fd is in exceptfds during select
    • LAPD_ERR_UNKNOWN_TEI is generated by LAPD code after all processing
So this means it can currently only be triggered in the Nokia code, and it's likely one of the non-signal cases:
  • reset_timer_cb() (stop all; start OML)
    • called from osmocom timer abstraction; ruled out
  • ACK for CONF_DATA was received (start RSL)
    • only starts the LAPD link
  • when ACK for RESET was received (stop all)
    • this looks like the only code path causing it. We receive an OML message (LAPD I-frame), and during processing of that message we stop the datalink, and then return back into LAPD I-frame processing but the data link is gone.

#10 Updated by laforge over 1 year ago

fundamentally, this is one of the drawbacks of our 'call everything in-line/synchronous' architecture. This has proven to be suboptimal in a variety of sitations already, such as FSM event dispatch, or also here.

If the L3 entity (user of the DL-SAP provided by LAPD) would have an inbound message queue, we would not process the L3 message in-line, but simply put it in the queue and terminate LAPD processing. Once that is finished, we (or some scheduler) would check if there is new data in the L3 input queue, which would process the event. At that point, deleting the LAPD instance was no longer a problem....

#11 Updated by laforge over 1 year ago

For now the only trivial solution I can see is to consider removing the DL SAP instance during L3 message processing illegal. The Nokia BTS driver must do so asynchronously.

#12 Updated by laforge over 1 year ago

  • Project changed from libosmocore to OsmoBSC
  • Category set to Nokia BTS
  • % Done changed from 20 to 90

Proposed fix in https://gerrit.osmocom.org/c/osmo-bsc/+/18009 - still needs manual testing/verification

#13 Updated by laforge over 1 year ago

  • Status changed from In Progress to Resolved
  • % Done changed from 90 to 100

tnt has reported that the [meanwhile merged] fix solves the problem.

#14 Updated by laforge over 1 year ago

  • Related to Bug #4646: SEGV when bringing up Nokia InSite added

#15 Updated by laforge over 1 year ago

  • Related to Bug #1982: LAPD: segfault in lapd_est_req function added

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)