Project

General

Profile

Bug #3273

osmo-bts-octphy failing to start in prod setup

Added by pespin 2 months ago. Updated 13 days ago.

Status:
Feedback
Priority:
Normal
Assignee:
Target version:
-
Start date:
05/16/2018
Due date:
% Done:

50%

Estimated time:
Spec Reference:

Description

Since a few days ago, the osmo-bts-octphy tests fail during startup of the OCTBTS:

(launched: 2018-05-16_13:19:33.210299)
<0006> l1_if.c:770 model_init()
20180516131933306 DLCTRL <0017> control_if.c:866 CTRL at 127.0.0.1 4238
20180516131933306 DLGLOBAL <0010> telnet_interface.c:104 telnet at 127.0.0.1 4241
20180516131933307 DLINP <0012> input/ipaccess.c:891 enabling ipaccess BTS mode, OML connecting to 10.42.42.7:3002
20180516131933307 DABIS <000d> abis.c:229 Input Signal 4 received
20180516131933307 DL1C <0006> phy_link.c:58 PHY link state change shutdown -> connecting
20180516131933307 DL1C <0006> l1_if.c:1740 Opening L1 interface for OctPHY (enp2s0)
20180516131933321 DLINP <0012> input/ipa.c:131 10.42.42.7:3002 connection done
20180516131933321 DLINP <0012> input/ipaccess.c:712 received ID get from 8/0/0
20180516131933322 DABIS <000d> abis.c:107 OML Signalling link up
20180516131933322 DOML <0001> oml.c:313 OC=SITE-MANAGER INST=(ff,ff,ff) Tx STATE CHG REP
20180516131933322 DOML <0001> oml.c:313 OC=BTS INST=(00,ff,ff) Tx STATE CHG REP
20180516131933322 DOML <0001> oml.c:313 OC=GPRS-NSE INST=(00,ff,ff) Tx STATE CHG REP
20180516131933322 DOML <0001> oml.c:313 OC=GPRS-CELL INST=(00,ff,ff) Tx STATE CHG REP
20180516131933322 DOML <0001> oml.c:313 OC=GPRS-NSVC INST=(00,00,ff) Tx STATE CHG REP
20180516131933322 DOML <0001> oml.c:313 OC=GPRS-NSVC INST=(00,01,ff) Tx STATE CHG REP
20180516131933322 DOML <0001> oml.c:313 OC=RADIO-CARRIER INST=(00,00,ff) Tx STATE CHG REP
20180516131933322 DOML <0001> oml.c:313 OC=BASEBAND-TRANSCEIVER INST=(00,00,ff) Tx STATE CHG REP
20180516131933322 DOML <0001> oml.c:313 OC=CHANNEL INST=(00,00,00) Tx STATE CHG REP
20180516131933322 DOML <0001> oml.c:313 OC=CHANNEL INST=(00,00,01) Tx STATE CHG REP
20180516131933322 DOML <0001> oml.c:313 OC=CHANNEL INST=(00,00,02) Tx STATE CHG REP
20180516131933323 DOML <0001> oml.c:313 OC=CHANNEL INST=(00,00,03) Tx STATE CHG REP
20180516131933323 DOML <0001> oml.c:313 OC=CHANNEL INST=(00,00,04) Tx STATE CHG REP
20180516131933323 DOML <0001> oml.c:313 OC=CHANNEL INST=(00,00,05) Tx STATE CHG REP
20180516131933323 DOML <0001> oml.c:313 OC=CHANNEL INST=(00,00,06) Tx STATE CHG REP
20180516131933323 DOML <0001> oml.c:313 OC=CHANNEL INST=(00,00,07) Tx STATE CHG REP
20180516131933323 DOML <0001> oml.c:348 OC=RADIO-CARRIER INST=(00,00,ff) AVAIL STATE Power off -> OK
20180516131933323 DOML <0001> oml.c:355 OC=RADIO-CARRIER INST=(00,00,ff) OPER STATE NULL -> Disabled
20180516131933323 DOML <0001> oml.c:313 OC=RADIO-CARRIER INST=(00,00,ff) Tx STATE CHG REP
20180516131933323 DOML <0001> oml.c:467 OC=RADIO-CARRIER INST=(00,00,ff) Tx SW ACT REP
20180516131933323 DOML <0001> oml.c:348 OC=BASEBAND-TRANSCEIVER INST=(00,00,ff) AVAIL STATE Power off -> OK
20180516131933323 DOML <0001> oml.c:313 OC=BASEBAND-TRANSCEIVER INST=(00,00,ff) Tx STATE CHG REP
20180516131933323 DOML <0001> oml.c:467 OC=BASEBAND-TRANSCEIVER INST=(00,00,ff) Tx SW ACT REP
20180516131933323 DOML <0001> oml.c:348 OC=CHANNEL INST=(00,00,00) AVAIL STATE Power off -> Dependency
20180516131933323 DOML <0001> oml.c:355 OC=CHANNEL INST=(00,00,00) OPER STATE NULL -> Disabled
20180516131933323 DOML <0001> oml.c:313 OC=CHANNEL INST=(00,00,00) Tx STATE CHG REP
20180516131933323 DOML <0001> oml.c:348 OC=CHANNEL INST=(00,00,01) AVAIL STATE Power off -> Dependency
20180516131933323 DOML <0001> oml.c:355 OC=CHANNEL INST=(00,00,01) OPER STATE NULL -> Disabled
20180516131933323 DOML <0001> oml.c:313 OC=CHANNEL INST=(00,00,01) Tx STATE CHG REP
20180516131933323 DOML <0001> oml.c:348 OC=CHANNEL INST=(00,00,02) AVAIL STATE Power off -> Dependency
20180516131933324 DOML <0001> oml.c:355 OC=CHANNEL INST=(00,00,02) OPER STATE NULL -> Disabled
20180516131933324 DOML <0001> oml.c:313 OC=CHANNEL INST=(00,00,02) Tx STATE CHG REP
20180516131933324 DOML <0001> oml.c:348 OC=CHANNEL INST=(00,00,03) AVAIL STATE Power off -> Dependency
20180516131933324 DOML <0001> oml.c:355 OC=CHANNEL INST=(00,00,03) OPER STATE NULL -> Disabled
20180516131933324 DOML <0001> oml.c:313 OC=CHANNEL INST=(00,00,03) Tx STATE CHG REP
20180516131933324 DOML <0001> oml.c:348 OC=CHANNEL INST=(00,00,04) AVAIL STATE Power off -> Dependency
20180516131933324 DOML <0001> oml.c:355 OC=CHANNEL INST=(00,00,04) OPER STATE NULL -> Disabled
20180516131933324 DOML <0001> oml.c:313 OC=CHANNEL INST=(00,00,04) Tx STATE CHG REP
20180516131933324 DOML <0001> oml.c:348 OC=CHANNEL INST=(00,00,05) AVAIL STATE Power off -> Dependency
20180516131933324 DOML <0001> oml.c:355 OC=CHANNEL INST=(00,00,05) OPER STATE NULL -> Disabled
20180516131933324 DOML <0001> oml.c:313 OC=CHANNEL INST=(00,00,05) Tx STATE CHG REP
20180516131933324 DOML <0001> oml.c:348 OC=CHANNEL INST=(00,00,06) AVAIL STATE Power off -> Dependency
20180516131933324 DOML <0001> oml.c:355 OC=CHANNEL INST=(00,00,06) OPER STATE NULL -> Disabled
20180516131933324 DOML <0001> oml.c:313 OC=CHANNEL INST=(00,00,06) Tx STATE CHG REP
20180516131933324 DOML <0001> oml.c:348 OC=CHANNEL INST=(00,00,07) AVAIL STATE Power off -> Dependency
20180516131933324 DOML <0001> oml.c:355 OC=CHANNEL INST=(00,00,07) OPER STATE NULL -> Disabled
20180516131933324 DOML <0001> oml.c:313 OC=CHANNEL INST=(00,00,07) Tx STATE CHG REP
20180516131933325 DOML <0001> oml.c:548 OC=BTS(01) INST=(ff,ff,ff) Rx GET ATTR
20180516131933325 DOML <0001> oml.c:277 BTS Tx Get Attribute Response
20180516131933380 DOML <0001> oml.c:548 OC=BASEBAND-TRANSCEIVER(04) INST=(00,00,ff) Rx GET ATTR
20180516131933380 DOML <0001> oml.c:277 BASEBAND-TRANSCEIVER Tx Get Attribute Response
20180516131933380 DOML <0001> oml.c:229 O&M Get Attributes [0], Manufacturer Dependent State is unsupported by TRX.
20180516131933382 DOML <0001> oml.c:1001 OC=SITE-MANAGER(00) INST=(ff,ff,ff) Rx OPSTART
20180516131933382 DOML <0001> oml.c:1012 ... automatic ACK, OP state already was Enabled
20180516131933383 DOML <0001> oml.c:580 OC=BTS(01) INST=(00,ff,ff) Rx SET BTS ATTR
20180516131933383 DOML <0001> oml.c:681 Ignoring T200[0] (150 ms) as sent by BSC due to suspected LAPDm bug!
20180516131933383 DOML <0001> oml.c:681 Ignoring T200[1] (180 ms) as sent by BSC due to suspected LAPDm bug!
20180516131933383 DOML <0001> oml.c:681 Ignoring T200[2] (180 ms) as sent by BSC due to suspected LAPDm bug!
20180516131933383 DOML <0001> oml.c:681 Ignoring T200[3] (1680 ms) as sent by BSC due to suspected LAPDm bug!
20180516131933383 DOML <0001> oml.c:681 Ignoring T200[4] (520 ms) as sent by BSC due to suspected LAPDm bug!
20180516131933384 DOML <0001> oml.c:681 Ignoring T200[5] (165 ms) as sent by BSC due to suspected LAPDm bug!
20180516131933384 DOML <0001> oml.c:681 Ignoring T200[6] (1680 ms) as sent by BSC due to suspected LAPDm bug!
20180516131933384 DOML <0001> oml.c:451 Sending FOM ACK.
20180516131933385 DOML <0001> oml.c:1029 OC=BTS(01) INST=(00,ff,ff) Rx CHG ADM STATE
20180516131933385 DOML <0001> oml.c:313 OC=BTS INST=(00,ff,ff) Tx STATE CHG REP
20180516131933386 DOML <0001> oml.c:1001 OC=BTS(01) INST=(00,ff,ff) Rx OPSTART
20180516131933386 DOML <0001> oml.c:355 OC=BTS INST=(00,ff,ff) OPER STATE NULL -> Enabled
20180516131933386 DOML <0001> oml.c:313 OC=BTS INST=(00,ff,ff) Tx STATE CHG REP
20180516131933430 DOML <0001> oml.c:1001 OC=RADIO-CARRIER(02) INST=(00,00,ff) Rx OPSTART
20180516131933432 DOML <0001> oml.c:750 OC=RADIO-CARRIER(02) INST=(00,00,ff) Rx SET RADIO CARRIER ATTR
20180516131933432 DOML <0001> oml.c:782 Set RF Max Power Reduction = 0 dBm
20180516131933432 DL1C <0006> tx_power.c:248 power_ramp_start(cur=0, tgt=0)
20180516131933432 DOML <0001> oml.c:451 Sending FOM ACK.
20180516131933433 DOML <0001> oml.c:1029 OC=RADIO-CARRIER(02) INST=(00,00,ff) Rx CHG ADM STATE
20180516131933434 DOML <0001> oml.c:1055 ADM state already was Unlocked
20180516131933434 DOML <0001> oml.c:313 OC=RADIO-CARRIER INST=(00,00,ff) Tx STATE CHG REP
20180516131933435 DOML <0001> oml.c:1001 OC=RADIO-CARRIER(02) INST=(00,00,ff) Rx OPSTART
20180516131933435 DL1C <0006> l1_oml.c:1228 Tx APP-INFO.req
20180516131933435 DL1C <0006> l1_oml.c:1159 Tx APP-INFO-SYSTEM.req
20180516131933436 DL1C <0006> l1_oml.c:1364 Tx TRX-OPEN.req(trx=0, rf_port=0, arfcn=868, center=868, tsc=7, rx_gain=70, tx_atten=0)
20180516131934384 DLINP <0012> e1_input.c:235 abis_sendmsg: msg->dst == NULL: 0c 12 01 90 0f ff ff 
20180516131934433 DL1C <0006> tx_power.c:181 ramping TRX board output power to 0 mdBm.
20180516131935385 DLINP <0012> e1_input.c:235 abis_sendmsg: msg->dst == NULL: 0c 12 01 90 0f ff ff 
20180516131936385 DLINP <0012> e1_input.c:235 abis_sendmsg: msg->dst == NULL: 0c 12 01 90 0f ff ff 
20180516131937386 DLINP <0012> e1_input.c:235 abis_sendmsg: msg->dst == NULL: 0c 12 01 90 0f ff ff 
20180516131938321 DL1C <0006> l1_if.c:203 Timeout waiting for L1 primitive TRX-CLOSE-ALL

Attach a run.tar.gz from osmo-gsm-tester showing the issue.

trial-1304-run.tgz trial-1304-run.tgz 12.1 MB pespin, 05/16/2018 02:28 PM

History

#1 Updated by dexter about 1 month ago

  • Status changed from New to Feedback
  • Assignee changed from dexter to pespin
  • % Done changed from 0 to 50

I have now checked back on this problem. The DSP (both!) were crashed, so that they became unresponsive. A powercycle helped to get it working again.

In order to test I temporarily connected the LTHW Octbts to my test laptop. I got the exact same results as shown in the log above. So it is definitely nothing wrong with the tester.

I fear that this problem will come back from time to time. Both DSPs have their reset lines connected to a GPIO of the Phytec board. Those reset lines can be controlled via dsp_rst.sh, which comes pre-installed on the phytec board. The reset procedure would look like:

dsp_rst.sh on p
dsp_rst.sh off p
dsp_rst.sh on s
dsp_rst.sh off s

Where p = Primary DSP, s = Secondary DSP, on = Reset is asserted, off = Reset is de-asserted

My suggestion is to integrate the reset procedure in the testing process. Maybe we only should reset only when necessary. We have discussed earlier if it makes sense to reset the DSPs before each test, but we came to the conclusion that this is not wise as this is not a normal operating condition. The dsp firmware should survive multiple restarts of openbts.

#2 Updated by pespin about 1 month ago

Ok Thanks, let's not add the reset by default before each tests unless we see it's really necessary. Good to have it documented here, so we can come back to it if it happens again. I'll close the ticket once we have next test run passing with octphy.

#3 Updated by dexter 27 days ago

I have now inspected the log files in the jenkins build artefacts (run.2018-06-25_10-03-06). The current problem seems not to be octphy related. osmo-bts-octphy fails to start because it has problems to bind the telnet/ctrl interface. I have looked into all octphy related tests. The error is the same everywhere. I do not know the environment well enough, but maybe there were some changes or there is still another instance running when osmo-bsc-octphy starts.

(launched: 2018-06-25_11:24:21.828663)
<0006> l1_if.c:770 model_init()
20180625112421929 DLGLOBAL <0010> socket.c:360 unable to bind socket:10.42.42.52:4238: Cannot assign requested address
20180625112421930 DLGLOBAL <0010> socket.c:371 no suitable addr found for: 10.42.42.52:4238
20180625112421930 DLGLOBAL <0010> socket.c:360 unable to bind socket:10.42.42.52:4241: Cannot assign requested address
20180625112421930 DLGLOBAL <0010> socket.c:371 no suitable addr found for: 10.42.42.52:4241
20180625112421930 DLGLOBAL <0010> telnet_interface.c:100 Cannot bind telnet at 10.42.42.52 4241
Error initializing telnet

#4 Updated by pespin 25 days ago

  • Assignee changed from pespin to dexter

Previous issues fixed.

Current issue:

[0;m20180627051913862 DOML <0001> oml.c:1001 OC=RADIO-CARRIER(02) INST=(00,00,ff) Rx OPSTART
20180627051913862 DL1C <0006> l1_oml.c:1228 Tx APP-INFO.req
20180627051913862 DL1C <0006> l1_oml.c:1159 Tx APP-INFO-SYSTEM.req
20180627051913862 DL1C <0006> l1_oml.c:1364 Tx TRX-OPEN.req(trx=0, rf_port=0, arfcn=868, center=868, tsc=7, rx_gain=70, tx_atten=0)
20180627051913863 DL1C <0006> l1_oml.c:1180 Rx APP-INFO.resp (name='octsdr_gsm', desc='Software Define Radio', ver='02.11.00-B1927', ver_hdr='02.11.00-B1927')
20180627051913864 DL1C <0006> l1_oml.c:1128 Rx APP-INFO-SYSTEM.resp (platform='Opus2', version='OCTSYS_VERSION=01.02.26.B1;OCTODK_VERSION=01.17.05-B7;OCTADF_VERSION=04.10.01-B3387;')
20180627051913864 DL1C <0006> l1_oml.c:1131 Note: compiled with multi-trx support.
20180627051913864 DL1C <0006> l1_oml.c:1291 TRX-OPEN.resp(trx=0) = cOCTVC1_GSM_RC_TRX_RADIO_CONFIG
20180627051913865 DL1C <0006> l1_oml.c:1296 TRX-OPEN failed: cOCTVC1_GSM_RC_TRX_RADIO_CONFIG

#5 Updated by dexter 25 days ago

  • Assignee changed from dexter to pespin

I have tested the octbts locally again. I got the same error. A powercycle helped to resolve the problem. The octbts is now up and running again.

(I just wonder if there is a license option that limits the runtime...)

#6 Updated by pespin 20 days ago

  • Assignee changed from pespin to dexter

It seems the Firmware crashed again. It basically crashes every time we run a round of tests in one jenkins job. We should investigate how/when the firmware crashes (if it's our fault) and contact OCTOBTS support to get it fixed.

#7 Updated by dexter 13 days ago

Unfortunately I do not have a response from Octasic yet.

I think I will leave test board I have on my desktop on over night to see if the problem is even connected to any outer factors.

Is there a way to tell which test is causing the fault. Does it hang always at the same test or is it always different?

#8 Updated by pespin 13 days ago

The way to check if a specific test is triggering the issue is to restart the firmware (having it in a working state), then running the hourly set of tests in osmo-gsm-tester, then check the logs to see the first test failing, and repeating the entire procedure some times. If it always fails in the same test, it means probably either that one or the previous one is crashing the firmware. Then manually reproducing same test can help discover more information. However, it could be that we are unable to manually trigger the issue due to timing constrains (ie. osmo-gsm-tester triggering events quickly automatically as a response to other events).

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)