Bug #4061
closedProlonged remaining in state with RSL link gone, OML link open
100%
Description
Possibly with unreliable Abis link (wifi), osmo-bts seems to get into a state
where the OML is open, but RSL is down. I believe osmo-bts used to detect this and exit, subseuqently being restarted by OS
"drop bts connection X oml" from the bsc side does not do anything.
Will try to investigate more and add info here each time I find it happening..
root@sysmobts-v2:~# netstat Active Internet connections (w/o servers) Proto Recv-Q Send-Q Local Address Foreign Address State tcp 0 0 172.16.0.11:51368 172.16.0.1:3002 ESTABLISHED
Updated by keith almost 5 years ago
VTY info when osmo-bts is in this state:
OsmoBTS# show version OsmoBTS 0.8.1.202-e1da (OsmoBTS).
OsmoBTS# show bts 0 BTS 0 is of FIXME type in band GSM850, has CI 0 LAC 0, BSIC 0 and 1 TRX Description: (null) Unit ID: 1000/0/0, OML Stream ID 0x00 NM State: Oper 'NULL', Admin 'unknown 0x0', Avail 'Dependency' Site Mgr NM State: Oper 'Enabled', Admin 'unknown 0x0', Avail 'OK' Paging: Queue size 200, occupied 0, lifetime 0s AGCH: Queue limit 0, occupied 0, dropped 0, merged 0, rejected 0, ag-res 0, non-res 0 CBCH backlog queue length: 0 Paging: queue length 0, buffer space 200 OML Link state: disconnected. TRX 0 phy 0 0.0 dsp 0.0.0 fpga 0.0.0 Features: 001 GPRS 002 EGPRS 006 OML Alerts 007 AGCH/PCH proportional allocation 009 Fullrate speech V1 010 Halfrate speech V1 011 Fullrate speech EFR 012 Fullrate speech AMR 013 Halfrate speech AMR base transceiver station: Received paging requests (Abis): 0 (0/s 0/m 0/h 0/d) Dropped paging requests (Abis): 0 (0/s 0/m 0/h 0/d) Sent paging requests (Um): 0 (0/s 0/m 0/h 0/d) Received RACH requests (Um): 0 (0/s 0/m 0/h 0/d) Dropped RACH requests (Um): 0 (0/s 0/m 0/h 0/d) Received RACH requests (Handover): 0 (0/s 0/m 0/h 0/d) Received RACH requests (CS/Abis): 0 (0/s 0/m 0/h 0/d) Received RACH requests (PS/PCU): 0 (0/s 0/m 0/h 0/d) Received AGCH requests (Abis): 0 (0/s 0/m 0/h 0/d) Sent AGCH requests (Abis): 0 (0/s 0/m 0/h 0/d) Sent AGCH DELETE IND (Abis): 0 (0/s 0/m 0/h 0/d) OsmoBTS# show trx TRX 0 of BTS 0 is on ARFCN 0 Description: (null) RF Nominal Power: 37000 dBm, reduced by 0 dB, resulting BS power: 37000 dBm NM State: Oper 'Disabled', Admin 'Unlocked', Avail 'OK' RSL State: disconnected Baseband Transceiver NM State: Oper 'NULL', Admin 'unknown 0x0', Avail 'OK' IPA stream ID: 0x00
Updated by laforge almost 5 years ago
- Assignee set to Hoernchen
- Priority changed from Normal to High
Hoernchen, please look into this. I guess we need some proper review of the existing code, possibly resulting in the introduction of a proper FSM taking care about connection failures.
In general, our policy for OsmoBTS has always (since its creation) been to "fail fast", i.e. to terminate the process and let it re-spawn once the OML link is down. I'm not aware of any change of the related code in recent months/years.
For RSL, one can argue that it could/should reconnect while keeping OsmoBTS running, but for OML a restart is the "safe" choice as the RF carrier will be down during reconfiguration anyway, and a restart of the process will [via our systemd service on osmo-bts-{sysmo,lc15,oc2g} reset all state by reloading the FPGA bitstream and the DSP image, ensuring we always start from a 100% defined, clean state.
Updated by laforge almost 5 years ago
side note: it might also make sense to have a look how a nanoBTS behaves in comparison.
Updated by Hoernchen almost 5 years ago
I can confirm that osmo-bts does not detect a broken tcp connection if I drop the packets using iptables. A tcp connection with default settings will only time out after multiple hours, so this sounds like a reasonable explanation for the issue. osmo-bsc supports tcp keepalive using the config setting, but this is currently ignored by osmo-bts, it only used by the bsc callback function in libosmo-abis.
I've pushed a patch for libosmo-abis at https://gerrit.osmocom.org/c/libosmo-abis/+/14564 that allows using the usual timeout setting for ipa clients like osmo-bts, which fixes the issue for me.
Example config lines for osmo-bts:
e1_input e1_line 0 driver ipa e1_line 0 port 0 e1_line 0 keepalive 10 2 5
Updated by Hoernchen almost 5 years ago
I've added TCP_USER_TIMEOUT to the patch, too - keepalive only applies to idle connections, but this timeout applies to unacked data, the manpage says: "[...] failure may take up to 20 minutes with the current system defaults in a normal WAN environment."
Updated by laforge almost 5 years ago
Pleas also note that we don't have to rely on TCP keepalives or anything like that,
as there's the IPA PING/PONG mechanism as part of the IPA CCM sub-layer. Both sides should
actually send PING messages in periodic intervals, and give up if they don't receive a PONG
from the peer. It might be that we only implemented the "PING responder" part.
I recently added ipa_keepalive_fsm_start to libosmo-abis. It's used only in osmo-remsim,
but should probably be used by virtually any of our programs that implement IPA
multiplex, starting from BTS (Abis), BSC (Abis), MSC (SCCPlite, GSUP), HLR (GSUP),
all our CTRL interfaces, ... - but anything beyond RSL+OML in BTS+BSC is out of scope for this
ticket. I'll add separate tickets about this.
Updated by Hoernchen over 4 years ago
- % Done changed from 0 to 100
I'll close this for the time being, the patch that adds the tcp keepalive/timeouts for ipa clients was mergend, so unless you happen to have a weird connection that intercepts TCP connections the existing keepalive implementation will cover the bsc/bts rsl/oml case.