Project

General

Profile

Actions

Bug #6342

open

ms-sdr gprs performance

Added by Hoernchen 29 days ago. Updated 17 days ago.

Status:
Feedback
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
01/24/2024
Due date:
% Done:

0%

Resolution:
Spec Reference:

Description

A simple ping test using using one 74b ping packet per second

ping -I modem4 google.de

Is reliable, but in a very unreliable way:
 ✘ pi5@rp5  ~/bernd/osmo-trx  ➦ 669ee04 ±  ping -I modem4 google.de
PING  (142.251.36.163) from 176.16.222.14 modem4: 56(84) bytes of data.
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=1 ttl=117 time=1540 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=2 ttl=117 time=587 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=3 ttl=117 time=5486 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=4 ttl=117 time=4530 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=5 ttl=117 time=3584 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=6 ttl=117 time=2620 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=7 ttl=117 time=1656 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=8 ttl=117 time=692 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=9 ttl=117 time=532 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=10 ttl=117 time=654 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=11 ttl=117 time=5593 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=12 ttl=117 time=4621 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=13 ttl=117 time=3675 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=14 ttl=117 time=2711 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=15 ttl=117 time=1747 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=16 ttl=117 time=783 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=17 ttl=117 time=623 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=18 ttl=117 time=745 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=19 ttl=117 time=5703 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=20 ttl=117 time=4732 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=21 ttl=117 time=3791 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=22 ttl=117 time=2827 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=23 ttl=117 time=1863 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=24 ttl=117 time=899 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=25 ttl=117 time=740 ms
^C
---  ping statistics ---
26 packets transmitted, 25 received, 3.84615% packet loss, time 25381ms
rtt min/avg/max/mdev = 532.376/2517.521/5703.264/1774.915 ms, pipe 6

The transfer is chunky and stalls all the time, it always repeats the same block of messages:

snibedi snab 8< 8< 8<
20240124150628408 DL1C INFO l1ctl.c:754 Tx Reset Req (1)
20240124150628408 DL1C INFO l1ctl.c:404 Sync Req
20240124150628410 DL1C INFO l1ctl.c:764 Layer1 Reset indication
20240124150628410 DCS NOTICE app_modem.c:219 S_L1CTL_RESET
20240124150628441 DL1C INFO l1ctl.c:138 snr=0000, arfcn=871 result=0
20240124150628441 DCS NOTICE app_modem.c:243 S_L1CTL_FBSB_RESP
20240124150628510 DRR NOTICE grr.c:583 GPRS-RR(1)[0x55678c25b0]{PACKET_NOT_READY}: Cell is usable, GRR becomes ready
20240124150628551 DRR NOTICE grr.c:683 GPRS-RR(1)[0x55678c25b0]{PACKET_ACCESS}: Rx RACH.conf (RA=0x79, T1=6, T3=14, T2=23, FN=8837)
20240124150628734 DTUN DEBUG app_modem.c:122 APN(internet): system wants to transmit IPv4 pkt to 142.251.36.163 (84 bytes)
20240124150628734 DSNDCP INFO sndcp_prim.c:426 Rx from upper layers: SN-UNITDATA.request
20240124150628764 DL1C INFO l1ctl.c:553 Tx Dedic.Mode Est Req (arfcn=979, chan_nr=0xc3)
20240124150628822 DGMM INFO gmm_prim.c:677 Rx from lower layers: GMRR-LLC_TRANSMITTED.indication
20240124150628822 DGMM INFO gmm.c:311 GMME(IMSI-001010000000000:PTMSI-d041daef:TLLI-d041daef) READY timer started (expires in 44 seconds)
20240124150628882 DGMM INFO gmm_prim.c:677 Rx from lower layers: GMRR-LLC_TRANSMITTED.indication
20240124150628882 DGMM INFO gmm.c:311 GMME(IMSI-001010000000000:PTMSI-d041daef:TLLI-d041daef) READY timer started (expires in 44 seconds)
20240124150628942 DGMM INFO gmm_prim.c:677 Rx from lower layers: GMRR-LLC_TRANSMITTED.indication
20240124150628942 DGMM INFO gmm.c:311 GMME(IMSI-001010000000000:PTMSI-d041daef:TLLI-d041daef) READY timer started (expires in 44 seconds)
20240124150629021 DGMM INFO gmm_prim.c:677 Rx from lower layers: GMRR-LLC_TRANSMITTED.indication
20240124150629021 DGMM INFO gmm.c:311 GMME(IMSI-001010000000000:PTMSI-d041daef:TLLI-d041daef) READY timer started (expires in 44 seconds)
20240124150629098 DGMM INFO gmm_prim.c:677 Rx from lower layers: GMRR-LLC_TRANSMITTED.indication
20240124150629098 DGMM INFO gmm.c:311 GMME(IMSI-001010000000000:PTMSI-d041daef:TLLI-d041daef) READY timer started (expires in 44 seconds)
20240124150629159 DGMM INFO gmm_prim.c:677 Rx from lower layers: GMRR-LLC_TRANSMITTED.indication
20240124150629159 DGMM INFO gmm.c:311 GMME(IMSI-001010000000000:PTMSI-d041daef:TLLI-d041daef) READY timer started (expires in 44 seconds)
20240124150629200 DSNDCP INFO sndcp_prim.c:594 Rx from lower layers: LL-UNITDATA.indication
20240124150629260 DSNDCP INFO sndcp_prim.c:594 Rx from lower layers: LL-UNITDATA.indication
20240124150629338 DSNDCP INFO sndcp_prim.c:594 Rx from lower layers: LL-UNITDATA.indication
20240124150629398 DSNDCP INFO sndcp_prim.c:594 Rx from lower layers: LL-UNITDATA.indication
20240124150629458 DSNDCP INFO sndcp_prim.c:594 Rx from lower layers: LL-UNITDATA.indication
20240124150629518 DSNDCP INFO sndcp_prim.c:594 Rx from lower layers: LL-UNITDATA.indication
20240124150629735 DTUN DEBUG app_modem.c:122 APN(internet): system wants to transmit IPv4 pkt to 142.251.36.163 (84 bytes)
20240124150629735 DSNDCP INFO sndcp_prim.c:426 Rx from upper layers: SN-UNITDATA.request
20240124150630119 DGMM INFO gmm_prim.c:677 Rx from lower layers: GMRR-LLC_TRANSMITTED.indication
20240124150630119 DGMM INFO gmm.c:311 GMME(IMSI-001010000000000:PTMSI-d041daef:TLLI-d041daef) READY timer started (expires in 44 seconds)
20240124150630358 DSNDCP INFO sndcp_prim.c:594 Rx from lower layers: LL-UNITDATA.indication
20240124150630735 DTUN DEBUG app_modem.c:122 APN(internet): system wants to transmit IPv4 pkt to 142.251.36.163 (84 bytes)
20240124150630735 DSNDCP INFO sndcp_prim.c:426 Rx from upper layers: SN-UNITDATA.request
20240124150631240 DGMM INFO gmm_prim.c:677 Rx from lower layers: GMRR-LLC_TRANSMITTED.indication
20240124150631240 DGMM INFO gmm.c:311 GMME(IMSI-001010000000000:PTMSI-d041daef:TLLI-d041daef) READY timer started (expires in 44 seconds)
20240124150631480 DSNDCP INFO sndcp_prim.c:594 Rx from lower layers: LL-UNITDATA.indication
20240124150631736 DTUN DEBUG app_modem.c:122 APN(internet): system wants to transmit IPv4 pkt to 142.251.36.163 (84 bytes)
20240124150631736 DSNDCP INFO sndcp_prim.c:426 Rx from upper layers: SN-UNITDATA.request
20240124150632766 DTUN DEBUG app_modem.c:122 APN(internet): system wants to transmit IPv4 pkt to 142.251.36.163 (84 bytes)
20240124150632766 DSNDCP INFO sndcp_prim.c:426 Rx from upper layers: SN-UNITDATA.request
20240124150633790 DTUN DEBUG app_modem.c:122 APN(internet): system wants to transmit IPv4 pkt to 142.251.36.163 (84 bytes)
20240124150633790 DSNDCP INFO sndcp_prim.c:426 Rx from upper layers: SN-UNITDATA.request
20240124150634814 DTUN DEBUG app_modem.c:122 APN(internet): system wants to transmit IPv4 pkt to 142.251.36.163 (84 bytes)
20240124150634814 DSNDCP INFO sndcp_prim.c:426 Rx from upper layers: SN-UNITDATA.request
20240124150635838 DTUN DEBUG app_modem.c:122 APN(internet): system wants to transmit IPv4 pkt to 142.251.36.163 (84 bytes)
20240124150635838 DSNDCP INFO sndcp_prim.c:426 Rx from upper layers: SN-UNITDATA.request
20240124150636600 DRLCMAC NOTICE tbf_dl.c:103 TBF(DL:NR-5:TLLI-d041daef) Timeout of T3190

snibedi snab 8< 8< 8<
loops here

curl --interface modem4 http://fsn.icmp.hetzner.com/1GB.bin -o /dev/null

Fails completely, probably for the same reason.

pcap contains debug log level messages.

all of this is with multislot-class default 1 pcu config.


Files

brokengprs.pcapng brokengprs.pcapng 6.41 MB Hoernchen, 01/24/2024 02:23 PM
ms-cni-gprsping.tar.xz ms-cni-gprsping.tar.xz 2.77 MB ms lo + ms modem4 + cni lo Hoernchen, 01/24/2024 05:19 PM

Related issues

Related to OsmocomBB - Bug #6200: osmo-trx-ms: lots of @Received bad frame (rc=-1, ber=444/444)@NewHoernchen10/03/2023

Actions
Actions #1

Updated by pespin 29 days ago

  • Status changed from New to Feedback
  • Assignee changed from pespin to Hoernchen

Hoernchen I'd need to have also GSMTAP (CCH and PDCH RLC/MAC) to figure out what's going wrong there (better with the ping test to have easier output to debug).

Actions #2

Updated by Hoernchen 29 days ago

Actions #3

Updated by Hoernchen 24 days ago

  • Related to Bug #6200: osmo-trx-ms: lots of @Received bad frame (rc=-1, ber=444/444)@ added
Actions #4

Updated by pespin 23 days ago

  • Assignee changed from pespin to Hoernchen

My analysis/understanding from ms-cni-gprsping.tar.xz gprs_ms_ping_lo+modem4.pcapg:

By filtering out rlc/mac, one can see ping-pong works well for a while until 27269. Then, ping in 27515 and following ones are never answered.
By adding back rlc/mac and looking around the first ping packet never answered (27515), we see:


// REPLY FROM LAST WORKING PING REQ:
27269    17:15:46.662198808    Jan 24, 2024 18:15:46.662198808 CET    216.58.206.35        176.16.222.16        ICMP    84    ✓    Echo (ping) reply    id=0x6ece, seq=26/6656, ttl=59 (request in 26889)
...
// NOW PCU ACKs the whole UL TBF (FinalACK=true), so the UL TBF at the MS moves to RELEASING state to ack the UL ACK/NACK with a CTRL ACK. THIS MATCHES gprs_cni_ping.pcapng frame_nr=135893.
27363    17:15:46.603026834    Jan 24, 2024 18:15:46.603026834 CET    127.0.0.1    47476    127.0.0.1    4729    GSMTAP    177    ✓    TS=4 FN=2554608 Rx Pkt UL ACK/NACK
27373    17:15:46.603059538    Jan 24, 2024 18:15:46.603059538 CET    127.0.0.1    47476    127.0.0.1    4729    GSMTAP    193    ✓    UL_TBF[0x55804c8f60]{FINISHED}: Final ACK received 
27374    17:15:46.603063297    Jan 24, 2024 18:15:46.603063297 CET    127.0.0.1    47476    127.0.0.1    4729    GSMTAP    197    ✓    UL_TBF[0x55804c8f60]{FINISHED}: state_chg to RELEASING 
27375    17:15:46.603067427    Jan 24, 2024 18:15:46.603067427 CET    127.0.0.1    47476    127.0.0.1    4729    GSMTAP    189    ✓    Register POLL (TS=4 FN=2554621, reason=UL_ACK) 
...
// FINALLY THE FN MARKED BY PREVIOUS RRBP ARRIVES, AND MS RLC/MAC STACK TRANSMITS THE Tx Packet Control ACK AS EXPECTED, THIS MATCHES gprs_cni_ping.pcapng frame_nr=136212.
27415    17:15:46.644324395    Jan 24, 2024 18:15:46.644324395 CET    127.0.0.1    47476    127.0.0.1    4729    GSMTAP    204    ✓    (ts=4,fn=2554621,usf=0) Tx Pkt Control Ack (UL ACK/NACK poll) 
27418    17:15:46.644336877    Jan 24, 2024 18:15:46.644336877 CET    127.0.0.1    47476    127.0.0.1    4729    GSMTAP    178    ✓    GRE(f570fc75) Tx Packet Control Ack
27419    17:15:46.644340081    Jan 24, 2024 18:15:46.644340081 CET    127.0.0.1    47476    127.0.0.1    4729    GSMTAP    186    ✓    Tx to lower layers: L1CTL-PDCH_DATA.request 
27420    17:15:46.644343321    Jan 24, 2024 18:15:46.644343321 CET    127.0.0.1    47476    127.0.0.1    4729    GSMTAP    222    ✓    GPRS-RR(1)[0x55803cf5b0]{PACKET_TRANSFER}: Received Event GRR_EV_PDCH_BLOCK_REQ 
27421    17:15:46.644349358    Jan 24, 2024 18:15:46.644349358 CET    127.0.0.1    47476    127.0.0.1    4729    GSMTAP    186    ✓    UL_TBF_ASS[0x55804cd630]{IDLE}: Deallocated 
27422    17:15:46.644353007    Jan 24, 2024 18:15:46.644353007 CET    127.0.0.1    47476    127.0.0.1    4729    GSMTAP    224    ✓    UL_TBF[0x55804c8f60]{RELEASING}: Send L1CTL-CFG_UL_TBF.req ul_tbf_nr=18 (release) 
27423    17:15:46.644356192    Jan 24, 2024 18:15:46.644356192 CET    127.0.0.1    47476    127.0.0.1    4729    GSMTAP    187    ✓    Tx to lower layers: L1CTL-CFG_UL_TBF.request 
27424    17:15:46.644359266    Jan 24, 2024 18:15:46.644359266 CET    127.0.0.1    47476    127.0.0.1    4729    GSMTAP    227    ✓    GPRS-RR(1)[0x55803cf5b0]{PACKET_TRANSFER}: Received Event GRR_EV_PDCH_UL_TBF_CFG_REQ 
27425    17:15:46.644362692    Jan 24, 2024 18:15:46.644362692 CET    127.0.0.1    47476    127.0.0.1    4729    GSMTAP    187    ✓    UL_TBF[0x55804c8f60]{RELEASING}: Deallocated 

// From this point, the MS only has a DL TBF active. Upper layers want to sent next ping req:
27515    17:15:46.999199627    Jan 24, 2024 18:15:46.999199627 CET    176.16.222.16        216.58.206.35        ICMP    84    ✓    Echo (ping) request  id=0x6ece, seq=27/6912, ttl=64 (no response found!)
//... but it cannot start a UL TBF directly, needs to wait until PCU requests DL ACK/NACK through RRBP so it can ask for it. This happens at:
27526    17:15:46.764364749    Jan 24, 2024 18:15:46.764364749 CET    127.0.0.1    47476    127.0.0.1    4729    GSMTAP    181    ✓    (ts=4,fn=2554647,usf=7) Tx DL ACK/NACK 
// The above matches gprs_cni_ping.pcapng frame_nr=136561, where the PKT DL ACK/NACK received by the PCU contains: "Exist_Channel_Request_Description: 0". That's because the PING from 27515 was not yet enqueued in RLC/MAC?
// Indeed, because it receives the ping from the tun device later at (84 bytes matches the ping which makes sense):
27645    17:15:46.999229719    Jan 24, 2024 18:15:46.999229719 CET    127.0.0.1    47476    127.0.0.1    4729    GSMTAP    219    ✓    APN(internet): system wants to transmit IPv4 pkt to 216.58.206.35 (84 bytes) 
27651    17:15:46.999266849    Jan 24, 2024 18:15:46.999266849 CET    127.0.0.1    47476    127.0.0.1    4729    GSMTAP    207    ✓    GRE(f570fc75) Enqueueing LLC-PDU len=94 SAPI=SNDCP3 radio_prio=4 

// THIS MEANS WE HAVE TO WAIT FOR NEXT DL DATA WITH RRBP SO UE CAN SEND Exist_Channel_Request_Description=1 in DL ACK/NACK IN ORDER TO REQUEST A UL TBF. THAT WOULD BE gprs_cni_ping.pcapng frame_nr=136972, BUT MS RLC/MAC LAYERS DON'T RECEIVE IT PROPERLY!:
27642    17:15:46.981862584    Jan 24, 2024 18:15:46.981862584 CET    127.0.0.1    47476    127.0.0.1    4729    GSMTAP    179    ✓    Rx RTS.ind (fn=2554695, ts=4, usf=7) 
27653    17:15:47.004417364    Jan 24, 2024 18:15:47.004417364 CET    127.0.0.1    47476    127.0.0.1    4729    GSMTAP    191    ✓    Rx from lower layers: L1CTL-PDCH_DATA.indication 
27654    17:15:47.004420290    Jan 24, 2024 18:15:47.004420290 CET    127.0.0.1    47476    127.0.0.1    4729    GSMTAP    179    ✓    Dropping DL data block with length 0 // LOWER LAYER ERROR DECODING!!!!!!!!!!!

// AT THIS POINT, DL TBF STALLS SENDING LAST BSN=28 BECAUSE IT DOESN'T RECEIVE DL ACK/NACK DESPITE SENDING SEVERAL RRBPs, UNTIL IT SOMEHOW DECIDES THE UE IS GONE AT AROUND gprs_cni_ping.pcapng frame_nr=139899

TL;DR: Starting from here, lower layers start decoding and submitting 0-byte DL blocks to the upper rlc/mac layer. You can see it with this message appearing all the time, while comparing with the PCU trace you can see that it is actually sending stuff...:

27654    17:15:47.004420290    Jan 24, 2024 18:15:47.004420290 CET    127.0.0.1    47476    127.0.0.1    4729    GSMTAP    179    ✓    Dropping DL data block with length 0

Actions #5

Updated by Hoernchen 23 days ago

So fixeria - why would we reliably end up getting only 0 data after some time? So not just intermittant decoding issues, but realiably, repeatedly, with the same pattern after some time? Can this be related to the audio issues/dropping of bit error packets? What changes that would explain why it works at first, and then breaks?

Actions #6

Updated by Hoernchen 23 days ago

pespin the capture contains multiple repeating blocks/pings - do they all end like this? Do they fail at different points/stages?

Actions #7

Updated by pespin 23 days ago

Hoernchen that keeps happening until the MS decides the DL TBF is no longer available. And while that happens, it basically cannot transmit UL data because it doesn't receive the DL block with RRBP which would allow it to request a new UL TBF to the PCU.
Then, finally libosmo-gprs-rlcmac declares its DL TBF as non-valid and as soon as it goes back to IDLE mode, it will attempt packet access procedure. This can be seen in gprs_cni_ping.pcapng:

// FIRST DL BLOCK WHICH IS NOT RECEIVED AT THE MS (receives len=0):
136972    17:15:46.940324846    Jan 24, 2024 18:15:46.940324846 CET    127.0.0.1    50957    127.0.0.1    4729    GSM RLC/MAC    112    ✓    GPRS DL DATA: CS4 TFI=0 BSN=28 USF=7 [RRBP] Len=49
...
// AFTER ~5 seconds, the MS finally declares the DL TBF as gone and starts packet access again:
150418    17:15:51.985456521    Jan 24, 2024 18:15:51.985456521 CET    127.0.0.1    50957    127.0.0.1    4729    GSMTAP    59    ✓    (RACH) Channel Request 
150855    17:15:52.132117857    Jan 24, 2024 18:15:52.132117857 CET    127.0.0.1    50957    127.0.0.1    4729    GSMTAP    81    ✓    (CCCH) (RR) Immediate Assignment 

If we look at the other pcap (MS side), the decode len=0 situation seems to be temporary. During that time (~1 second), no DL block is received:

// FIRST ONE:
27654    17:15:47.004420290    Jan 24, 2024 18:15:47.004420290 CET    127.0.0.1    47476    127.0.0.1    4729    GSMTAP    179    ✓    Dropping DL data block with length 0
...
//LAST ONE:
28118    17:15:48.102615917    Jan 24, 2024 18:15:48.102615917 CET    127.0.0.1    47476    127.0.0.1    4729    GSMTAP    179    ✓    Dropping DL data block with length 0 
//AFTER THAT, IT CONTINUES RECEIVING DL STUFF PROPERLY WITH NO APPARENT CHANGE DONE BY RLC/MAC LAYER:
28128    17:15:48.121403866    Jan 24, 2024 18:15:48.121403866 CET    127.0.0.1    47476    127.0.0.1    4729    GSMTAP    185    ✓    TS=4 FN=2554937 Rx Pkt DL Dummy Ctrl Block 

But that second is enough for the PCU to conclude the MS is not there, because it asked N... (counter) times and didn't receive reponse. (If there were more MS, it would take more time to ask one MS for N retransmissions, so it would take longer to consider it gone).

The MS considers the DL TBF gone at this point in time (which matches what we see with RACH req in the other pcap):

30064    17:15:51.809884307    Jan 24, 2024 18:15:51.809884307 CET    127.0.0.1    47476    127.0.0.1    4729    GSMTAP    164    ✓    Tx L1CTL-PDCH_REL.req 
30067    17:15:51.809903344    Jan 24, 2024 18:15:51.809903344 CET    127.0.0.1    47476    127.0.0.1    4729    GSMTAP    159    ✓    Tx Reset Req (1) 
30071    17:15:51.810945269    Jan 24, 2024 18:15:51.810945269 CET    127.0.0.1    47476    127.0.0.1    4729    GSMTAP    156    ✓    S_L1CTL_RESET 
...
30085    17:15:51.940372242    Jan 24, 2024 18:15:51.940372242 CET    127.0.0.1    47476    127.0.0.1    4729    GSMTAP    220    ✓    UL_TBF_ASS[0x55804c8f60]{IDLE}: Requesting one-phase packet access using CCCH 

After that, everything seems to be fine again (it actually was fine already after the ~1 second, but unfortunately too late already for the PCU, so for the MS to find out that the DL TBF is gone it takes a while...)

This analysis matches with the ping output you reported, where everything works fine in general, but then you see a huge delay and losses for a while, then it works again.

So, to me, right now, this looks like lower layers at some point being buggy and reading/decoding crap during some amount of time (buffer overflow/underrun, whatever).

Actions #8

Updated by pespin 23 days ago

This is the caller diagram for the code path where DL block with len=0 is received, starting from the moment it is received at the unix domain socket:

layer2_read [unix fd, fill msg->l1h, from here on we have modem app]
    l1ctl_recv(hdr->msg_type=L1CTL_GPRS_DL_BLOCK_IND)
        msgb_pull(msg, "sizeof(struct l1ctl_hdr));
        msg->l1h = hdr->data;
        rx_l1_gprs_dl_block_ind
            msg->l2h = (void *)&ind->data[0];
            l1ctl_dl_block_cb
                grr_st_packet_transfer_action(ev=GRR_EV_PDCH_BLOCK_IND)
                    handle_pdch_block_ind
                        prim->l1ctl = (struct osmo_gprs_rlcmac_l1ctl_prim) {
                            .pdch_data_ind = {
                                .fn = fn,
                                .ts_nr = ind->hdr.tn,
                                .rx_lev = ind->meas.rx_lev,
                                .ber10k = osmo_load16be(&ind->meas.ber10k),
                                .ci_cb = osmo_load16be(&ind->meas.ci_cb),
                                .data_len = msgb_l2len(msg),
                                .data = msgb_l2(msg),
                            }
                        };
                        osmo_gprs_rlcmac_prim_lower_up (libosmo-gprs-rlcmac from here on)
                            gprs_rlcmac_prim_l1ctl_lower_up
                                rlcmac_prim_handle_l1ctl_pdch_data_ind
                                        if (OSMO_UNLIKELY(rlcmac_prim->l1ctl.pdch_data_ind.data_len == 0)) {
                                            LOGRLCMAC(LOGL_DEBUG, "Dropping DL data block with length 0\n");

So starting from the L1CTL socket and modem app + libosmo-gprs, it's not doing much. I think the problem is most probably on the other side of L1CTL, whoever that is (osmo-trx-ms? trxcon?). I'd go check there add some print for this case where the L1CTL_GPRS_DL_BLOCK_IND message is sent containing a "struct l1ctl_gprs_dl_block_ind" with len(ind.data)=0.

Actions #9

Updated by pespin 23 days ago

On the other side of the L1CTL, continue unwinding the call trace:

rx_pdtch_fn
    l1sched_lchan_emit_data_ind(const uint8_t *data, size_t data_len)
        msg = l1sched_prim_alloc(L1SCHED_PRIM_T_DATA, PRIM_OP_INDICATION); // sets h2 to ->data before pushing prim header
        if (data_len > 0) memcpy(msgb_put(msg, data_len), data, data_len);
        l1sched_prim_to_user(OSMO_PRIM(L1SCHED_PRIM_T_DATA, PRIM_OP_INDICATION))
            handle_prim_data_ind(msg)
                trxcon_st_packet_data_action(ev=TRXCON_EV_RX_DATA_IND, data=struct trxcon_param_rx_data_ind)
                    l1gprs_handle_dl_block_ind
                        struct trxcon_param_rx_data_ind ind = {
                                .data_len = msgb_l2len(msg),
                                .data = msgb_l2(msg),
                                ...
                    }

Starting from l1sched_lchan_emit_data_ind(), it's mostly forwarding the buffer + len untouched/copied, so I'd start by tapping into that function and validating in the test run that it's already len=0 in there. Then, start debugging starting from rx_pdtch_fn() and going further down the stack.

Actions #10

Updated by fixeria 23 days ago

Hoernchen wrote in #note-5:

So fixeria - why would we reliably end up getting only 0 data after some time? So not just intermittant decoding issues, but realiably, repeatedly, with the same pattern after some time?

How do I answer this question without seeing trxcon's (osmo-trx-ms) logging? I have no idea, sorry. There can be several reasons:

  • burst decoding errors (see #6200)
  • no TBFs configured at the l1gprs layer
    • I think we saw exactly this problem in #6338

Can this be related to the audio issues/dropping of bit error packets? What changes that would explain why it works at first, and then breaks?

I don't think so. The modem has nothing to do with audio handling, it's implemented for the Mobile only.

Actions #11

Updated by laforge 23 days ago

On Tue, Jan 30, 2024 at 04:19:54PM +0000, fixeria wrote:

How do I answer this question without seeing trxcon's (osmo-trx-ms) logging? I have no idea, sorry. There can be several reasons:

can you not run the code on the test setup and see the errors?

What I'm seeing here is:

  • ewild is observing some kind of problem in his local setup
  • vyanitskiy and/or pespin just providing some off-line anlysis but not making any statement like "I cannot
    reproduce this bug in the sysmocom office hardware test setup" or a statement they can reproduce the bug.

If ewild sees some problem but pespin / vyanitskiy cannot reproduce it in "their" setup (the remotely
accessibele one at the sysmocom office), we need to investigate what the difference is between those and why
only ewild can reproduce the problem while it works for the others.

Actions #12

Updated by pespin 23 days ago

  • no TBFs configured at the l1gprs layer
    • I think we saw exactly this problem in #6338

This is not the case here, the issues start appearing without any sort of TBF change nearby, it's simply that the received blocks are len=0 out of the blue.

  • burst decoding errors (see #6200)

That's probably going to be the issue there, unfortunately I don't think I'm going to be of much use there.

Let me know if you need any kind of support from me, or help in a debugging session, etc.

vyanitskiy and/or pespin just providing some off-line anlysis but not making any statement like "I cannot
reproduce this bug in the sysmocom office hardware test setup" or a statement they can reproduce the bug.

Probably it can be reproduced there. Last time I checked months ago I basically stopped at some point because I was burning hours like hell because lower layers were not reliable yet, so I couldn't get to reproducible scenarios in upper layers.

So I'm at first reluctant to continue burning hours on my own unless somebody points at an RLC/MAC & upper problem which needs fixing and I have a more or less reliable lower layer to test it.

Actions #13

Updated by Hoernchen 23 days ago

Looks much better setting pcu to "cs max 3" - cs4 is apparently completely broken?

64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=1 ttl=117 time=1261 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=2 ttl=117 time=796 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=3 ttl=117 time=704 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=4 ttl=117 time=721 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=5 ttl=117 time=723 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=6 ttl=117 time=302 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=7 ttl=117 time=660 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=8 ttl=117 time=636 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=9 ttl=117 time=634 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=10 ttl=117 time=631 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=11 ttl=117 time=633 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=12 ttl=117 time=630 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=13 ttl=117 time=629 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=14 ttl=117 time=630 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=15 ttl=117 time=628 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=16 ttl=117 time=625 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=17 ttl=117 time=628 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=18 ttl=117 time=625 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=19 ttl=117 time=624 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=20 ttl=117 time=625 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=21 ttl=117 time=622 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=22 ttl=117 time=620 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=23 ttl=117 time=622 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=24 ttl=117 time=620 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=25 ttl=117 time=618 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=26 ttl=117 time=621 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=27 ttl=117 time=618 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=28 ttl=117 time=616 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=29 ttl=117 time=618 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=30 ttl=117 time=616 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=31 ttl=117 time=613 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=32 ttl=117 time=4755 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=33 ttl=117 time=4314 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=34 ttl=117 time=3849 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=35 ttl=117 time=3420 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=36 ttl=117 time=2968 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=37 ttl=117 time=2493 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=38 ttl=117 time=2041 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=39 ttl=117 time=1612 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=40 ttl=117 time=1137 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=41 ttl=117 time=685 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=42 ttl=117 time=569 ms

Actions #14

Updated by pespin 23 days ago

While looking at this some more I stumbled at the fact that the period where the packets are wrongly decoded at the MS side, (GSM FNs 2554695-2554933), matches the time where PCU starts sending DL packets in CS4 coding scheme.
As soon as it goes back to send CS1 (because it considers the TBF to be death, and hence sends CTRL Dummy Blocks, hence CS1), the MS starts receiving the DL blocks.

So it seems the problem is related to CS4, or with some general issue which gets worse with CS4.

Those time frames correspond to the following frame nrs in the pcaps:
gprs_ms_ping_lo+modem4.pcapg: 136972-139899
gprs_cni_ping.pcapng: 27654-28118

Actions #15

Updated by pespin 23 days ago

Other CS4 DL blocks in other places in the pcap also fail (show the "Dropping DL data block with length 0" log message the the MS side). The failures match the whole span of the CS4 transmissions.
gprs_cni_ping.pcapng frame_nr=71993-74954 (GSM FN 2549451)
gprs_cni_ping.pcapng frame_nr=93309-96243
gprs_cni_ping.pcapng frame_nr=115375-118312
etc.

So it's definetly something to do with CS4.

Actions #16

Updated by Hoernchen 23 days ago

still going strong:

64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=3249 ttl=117 time=514 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=3250 ttl=117 time=493 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=3251 ttl=117 time=473 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=3252 ttl=117 time=454 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=3253 ttl=117 time=535 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=3254 ttl=117 time=524 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=3255 ttl=117 time=522 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=3256 ttl=117 time=523 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=3257 ttl=117 time=520 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=3258 ttl=117 time=518 ms
64 bytes from muc12s11-in-f3.1e100.net (142.251.36.163): icmp_seq=3259 ttl=117 time=520 ms

Actions #17

Updated by Hoernchen 23 days ago

Actual downloads are still a bit bad, and have a somewhat similar "stalling" issue, during which nothing happens fore quite some time until the transfer continues:

cni  curl --interface modem4 http://fsn.icmp.hetzner.com/1GB.bin -o /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0 1024M    0 86142    0     0   1122      0  11d 01h  0:01:16  11d 01h  1575

Actions #18

Updated by pespin 23 days ago

  • Assignee changed from Hoernchen to fixeria

Assigning to fixeria who's now looking at the CS4 decoding issues.
We also need to gather a pcap during big file download to figure out why they stall sometimes for small periods, or to see if we can find anything wrong.

Actions #19

Updated by fixeria 23 days ago

  • Status changed from Feedback to In Progress

laforge wrote in #note-11:

On Tue, Jan 30, 2024 at 04:19:54PM +0000, fixeria wrote:

How do I answer this question without seeing trxcon's (osmo-trx-ms) logging? I have no idea, sorry. There can be several reasons:

can you not run the code on the test setup and see the errors?

Of course I can, and this is what I am doing now. I managed to reproduce the problem, and actually it's the same problem I reported in #6200 4 months ago.
Except that I was mostly concerned about decoding errors, while Eric (in this ticket) is concerned about the overall performance.
Limiting 'cs max' to 3 in osmo-pcu.cfg also "fixes" #6200 (no more errors, ping delays reduced significantly).

What I'm seeing here is:

  • ewild is observing some kind of problem in his local setup
  • vyanitskiy and/or pespin just providing some off-line anlysis but not making any statement like "I cannot
    reproduce this bug in the sysmocom office hardware test setup" or a statement they can reproduce the bug.

I honestly see nothing wrong in asking the ticket author to provide more information, since most of the time we do off-line analysis looking at logging and PCAPs anyway (looking at the code in parallel). This way me and pespin would not need to spend additional time obtaining the necessary info ourselves. Also, it wasn't obvious to me that Eric is testing on his local setup, I assumed we're all using the sysmocom's mssdr one.

Anyway, I am currently testing CS4 locally with B210 and a normal phone, to make sure it's not a regression on the network side.

Actions #20

Updated by fixeria 23 days ago

fixeria wrote in #note-19:

Anyway, I am currently testing CS4 locally with B210 and a normal phone, to make sure it's not a regression on the network side.

CS4 works fine with a normal phone, so the problem is most likely on the MS side and not on the network side:

OsmoPCU# show rate-counters
        gprs:downlink_cs4:      11793 (0/s 0/m 0/h 0/d) CS4 downlink
          gprs:uplink_cs4:       2092 (0/s 0/m 0/h 0/d) CS4 Uplink
Actions #21

Updated by fixeria 23 days ago

On the MS side, though, even if limit osmo-pcu to use CS1..3 (excluding CS4), I am observing BER > 0 on PDTCH/DL:

20240130212135759 DSCHD NOTICE trxcon(0)[0x557f358d20]: TS3-PDTCH gsm0503_pdtch_decode(): rc=40, ber=224/676 (sched_lchan_pdtch.c:92)
20240130212135777 DSCHD NOTICE trxcon(0)[0x557f358d20]: TS3-PDTCH gsm0503_pdtch_decode(): rc=40, ber=224/676 (sched_lchan_pdtch.c:92)
20240130212135796 DSCHD NOTICE trxcon(0)[0x557f358d20]: TS3-PDTCH gsm0503_pdtch_decode(): rc=40, ber=224/676 (sched_lchan_pdtch.c:92)
20240130212135819 DSCHD NOTICE trxcon(0)[0x557f358d20]: TS3-PDTCH gsm0503_pdtch_decode(): rc=40, ber=224/676 (sched_lchan_pdtch.c:92)
20240130212135837 DSCHD NOTICE trxcon(0)[0x557f358d20]: TS3-PDTCH gsm0503_pdtch_decode(): rc=40, ber=224/676 (sched_lchan_pdtch.c:92)
20240130212135856 DSCHD NOTICE trxcon(0)[0x557f358d20]: TS3-PDTCH gsm0503_pdtch_decode(): rc=40, ber=224/676 (sched_lchan_pdtch.c:92)
20240130212135879 DSCHD NOTICE trxcon(0)[0x557f358d20]: TS3-PDTCH gsm0503_pdtch_decode(): rc=23, ber=4/456 (sched_lchan_pdtch.c:92)
20240130212135897 DSCHD NOTICE trxcon(0)[0x557f358d20]: TS3-PDTCH gsm0503_pdtch_decode(): rc=40, ber=224/676 (sched_lchan_pdtch.c:92)
20240130212135916 DSCHD NOTICE trxcon(0)[0x557f358d20]: TS3-PDTCH gsm0503_pdtch_decode(): rc=40, ber=224/676 (sched_lchan_pdtch.c:92)
20240130212135920 DSCHD NOTICE trxcon(0)[0x557f358d20]: TS3-PTCCH gsm0503_pdtch_decode(): rc=23, ber=4/456 (sched_lchan_pdtch.c:92)
20240130212135939 DSCHD NOTICE trxcon(0)[0x557f358d20]: TS3-PDTCH gsm0503_pdtch_decode(): rc=40, ber=224/676 (sched_lchan_pdtch.c:92)
20240130212135957 DSCHD NOTICE trxcon(0)[0x557f358d20]: TS3-PDTCH gsm0503_pdtch_decode(): rc=40, ber=224/676 (sched_lchan_pdtch.c:92)
20240130212135976 DSCHD NOTICE trxcon(0)[0x557f358d20]: TS3-PDTCH gsm0503_pdtch_decode(): rc=40, ber=224/676 (sched_lchan_pdtch.c:92)
20240130212135999 DSCHD NOTICE trxcon(0)[0x557f358d20]: TS3-PDTCH gsm0503_pdtch_decode(): rc=40, ber=224/676 (sched_lchan_pdtch.c:92)
20240130212136017 DSCHD NOTICE trxcon(0)[0x557f358d20]: TS3-PDTCH gsm0503_pdtch_decode(): rc=40, ber=224/676 (sched_lchan_pdtch.c:92)
20240130212136036 DSCHD NOTICE trxcon(0)[0x557f358d20]: TS3-PDTCH gsm0503_pdtch_decode(): rc=40, ber=224/676 (sched_lchan_pdtch.c:92)
20240130212136058 DSCHD NOTICE trxcon(0)[0x557f358d20]: TS3-PDTCH gsm0503_pdtch_decode(): rc=23, ber=4/456 (sched_lchan_pdtch.c:92)
20240130212136077 DSCHD NOTICE trxcon(0)[0x557f358d20]: TS3-PDTCH gsm0503_pdtch_decode(): rc=23, ber=4/456 (sched_lchan_pdtch.c:92)
20240130212136096 DSCHD NOTICE trxcon(0)[0x557f358d20]: TS3-PDTCH gsm0503_pdtch_decode(): rc=23, ber=4/456 (sched_lchan_pdtch.c:92)
20240130212136118 DSCHD NOTICE trxcon(0)[0x557f358d20]: TS3-PDTCH gsm0503_pdtch_decode(): rc=23, ber=4/456 (sched_lchan_pdtch.c:92)
20240130212136137 DSCHD NOTICE trxcon(0)[0x557f358d20]: TS3-PDTCH gsm0503_pdtch_decode(): rc=23, ber=4/456 (sched_lchan_pdtch.c:92)
20240130212136156 DSCHD NOTICE trxcon(0)[0x557f358d20]: TS3-PDTCH gsm0503_pdtch_decode(): rc=23, ber=4/456 (sched_lchan_pdtch.c:92)
20240130212136178 DSCHD NOTICE trxcon(0)[0x557f358d20]: TS3-PDTCH gsm0503_pdtch_decode(): rc=23, ber=4/456 (sched_lchan_pdtch.c:92)
20240130212136197 DSCHD NOTICE trxcon(0)[0x557f358d20]: TS3-PDTCH gsm0503_pdtch_decode(): rc=23, ber=4/456 (sched_lchan_pdtch.c:92)
20240130212136216 DSCHD NOTICE trxcon(0)[0x557f358d20]: TS3-PDTCH gsm0503_pdtch_decode(): rc=23, ber=4/456 (sched_lchan_pdtch.c:92)
20240130212136238 DSCHD NOTICE trxcon(0)[0x557f358d20]: TS3-PDTCH gsm0503_pdtch_decode(): rc=40, ber=224/676 (sched_lchan_pdtch.c:92)
20240130212136257 DSCHD NOTICE trxcon(0)[0x557f358d20]: TS3-PDTCH gsm0503_pdtch_decode(): rc=40, ber=224/676 (sched_lchan_pdtch.c:92)
20240130212136276 DSCHD NOTICE trxcon(0)[0x557f358d20]: TS3-PDTCH gsm0503_pdtch_decode(): rc=40, ber=224/676 (sched_lchan_pdtch.c:92)
20240130212136298 DSCHD NOTICE trxcon(0)[0x557f358d20]: TS3-PDTCH gsm0503_pdtch_decode(): rc=40, ber=224/676 (sched_lchan_pdtch.c:92)
20240130212136317 DSCHD NOTICE trxcon(0)[0x557f358d20]: TS3-PDTCH gsm0503_pdtch_decode(): rc=40, ber=224/676 (sched_lchan_pdtch.c:92)

This output can be seen with some additional debug LOGPs in trxcon.

Not sure if those are the real bit errors, or maybe the BER calculation logic is not implemented correctly in libosmocoding?

  • gsm0503_pdtch_decode(): rc=23, ber=4/456 -- this is CS-1, but why are there 4 bit errors?
  • gsm0503_pdtch_decode(): rc=40, ber=224/676 -- this is CS-3, but looks suspicious:
    • 676 is how many bits the convolutional encoder produces before puncturing;
    • 224 is close to how many bits (220) are punctured.

I will do some more testing locally to see which BER values we have when running a virtual setup (trxcon + fake_trx).

Actions #22

Updated by fixeria 23 days ago

fixeria wrote in #note-21:

I will do some more testing locally to see which BER values we have when running a virtual setup (trxcon + fake_trx).

The BER values look identical in a virtual setup, so this is indeed related to puncturing.
We can also see unexpected BER values in the unit test output for libosmocoding:

https://cgit.osmocom.org/libosmocore/tree/tests/coding/coding_test.ok

# PDTCH, CS-1, len=23, no puncturing (same as used for CCCH/BCCH/SDCCH/SACCH)
Encoding: a3 af 5f c6 36 43 44 ab d9 6d 7d 62 24 c9 d2 92 fa 27 5d 71 7a 59 a8  
 Decoded: a3 af 5f c6 36 43 44 ab d9 6d 7d 62 24 c9 d2 92 fa 27 5d 71 7a 59 a8  
pdtch_decode: n_errors=0 n_bits_total=456 ber=0.00

# PDTCH, CS-2, len=34, puncturing removes 132 out of 588 bits, BER ~22%
Encoding: a3 af 5f c6 36 43 44 ab d9 6d 7d 62 24 c9 d2 92 fa 27 5d 71 7a 59 a8 42 a3 af 5f c6 36 43 44 ab a3 2f
 Decoded: a3 af 5f c6 36 43 44 ab d9 6d 7d 62 24 c9 d2 92 fa 27 5d 71 7a 59 a8 42 a3 af 5f c6 36 43 44 ab a3 2f
pdtch_decode: n_errors=132 n_bits_total=588 ber=0.22

# PDTCH, CS-3, len=40, puncturing removes 220 out of 676 bits, BER ~33%
Encoding: a3 af 5f c6 36 43 44 ab d9 6d 7d 62 24 c9 d2 92 fa 27 5d 71 7a 59 a8 42 a3 af 5f c6 36 43 44 ab a3 af 5f c6 36 43 44 03
 Decoded: a3 af 5f c6 36 43 44 ab d9 6d 7d 62 24 c9 d2 92 fa 27 5d 71 7a 59 a8 42 a3 af 5f c6 36 43 44 ab a3 af 5f c6 36 43 44 03
pdtch_decode: n_errors=220 n_bits_total=676 ber=0.33

# PDTCH, CS-4, len=56, no puncturing, BER unknown (no convolutional encoding)
Encoding: a3 af 5f c6 36 43 44 ab d9 6d 7d 62 24 c9 d2 92 fa 27 5d 71 7a 59 a8 42 a3 af 5f c6 36 43 44 ab a3 af 5f c6 36 43 44 ab d9 6d 7d 62 24 c9 d2 92 fa 27 5d 71 7a 28
 Decoded: a3 af 5f c6 36 43 44 ab d9 6d 7d 62 24 c9 d2 92 fa 27 5d 71 7a 59 a8 42 a3 af 5f c6 36 43 44 ab a3 af 5f c6 36 43 44 ab d9 6d 7d 62 24 c9 d2 92 fa 27 5d 71 7a 28
pdtch_decode: n_errors=0 n_bits_total=444 ber=0.00

So we should be interpreting the reported BER values with a grain of salt. We need to fix libosmocoding, I will report a separate ticket for this.

Actions #23

Updated by fixeria 23 days ago

fixeria wrote in #note-21:

I will do some more testing locally to see which BER values we have when running a virtual setup (trxcon + fake_trx).

I was able to attach to the network with CS-4 being forced in osmo-pcu.cfg in a virtual setup.
This means that the problem is most likely somewhere in the PHY (below trxcon).

20240131055939947 DSCHD NOTICE trxcon(0)[0x6305ae44faa0]: TS7-PDTCH gsm0503_pdtch_decode(): rc=23, ber=0/456 (sched_lchan_pdtch.c:90)
20240131055939965 DSCHD NOTICE trxcon(0)[0x6305ae44faa0]: TS7-PDTCH gsm0503_pdtch_decode(): rc=23, ber=0/456 (sched_lchan_pdtch.c:90)
20240131055939984 DSCHD NOTICE trxcon(0)[0x6305ae44faa0]: TS7-PDTCH gsm0503_pdtch_decode(): rc=23, ber=0/456 (sched_lchan_pdtch.c:90)
20240131055940007 DSCHD NOTICE trxcon(0)[0x6305ae44faa0]: TS7-PDTCH gsm0503_pdtch_decode(): rc=54, ber=0/444 (sched_lchan_pdtch.c:90)
20240131055940025 DSCHD NOTICE trxcon(0)[0x6305ae44faa0]: TS7-PDTCH gsm0503_pdtch_decode(): rc=54, ber=0/444 (sched_lchan_pdtch.c:90)
20240131055940044 DSCHD NOTICE trxcon(0)[0x6305ae44faa0]: TS7-PDTCH gsm0503_pdtch_decode(): rc=54, ber=0/444 (sched_lchan_pdtch.c:90)
20240131055940067 DSCHD NOTICE trxcon(0)[0x6305ae44faa0]: TS7-PDTCH gsm0503_pdtch_decode(): rc=54, ber=0/444 (sched_lchan_pdtch.c:90)
20240131055940086 DSCHD NOTICE trxcon(0)[0x6305ae44faa0]: TS7-PDTCH gsm0503_pdtch_decode(): rc=54, ber=0/444 (sched_lchan_pdtch.c:90)
20240131055940104 DSCHD NOTICE trxcon(0)[0x6305ae44faa0]: TS7-PDTCH gsm0503_pdtch_decode(): rc=23, ber=0/456 (sched_lchan_pdtch.c:90)
20240131055940108 DSCHD NOTICE trxcon(0)[0x6305ae44faa0]: TS7-PTCCH gsm0503_pdtch_decode(): rc=23, ber=0/456 (sched_lchan_pdtch.c:90)
20240131055940127 DSCHD NOTICE trxcon(0)[0x6305ae44faa0]: TS7-PDTCH gsm0503_pdtch_decode(): rc=23, ber=0/456 (sched_lchan_pdtch.c:90)
20240131055940145 DSCHD NOTICE trxcon(0)[0x6305ae44faa0]: TS7-PDTCH gsm0503_pdtch_decode(): rc=23, ber=0/456 (sched_lchan_pdtch.c:90)
20240131055940164 DSCHD NOTICE trxcon(0)[0x6305ae44faa0]: TS7-PDTCH gsm0503_pdtch_decode(): rc=23, ber=0/456 (sched_lchan_pdtch.c:90)
20240131055940187 DSCHD NOTICE trxcon(0)[0x6305ae44faa0]: TS7-PDTCH gsm0503_pdtch_decode(): rc=23, ber=0/456 (sched_lchan_pdtch.c:90)
20240131055940206 DSCHD NOTICE trxcon(0)[0x6305ae44faa0]: TS7-PDTCH gsm0503_pdtch_decode(): rc=23, ber=0/456 (sched_lchan_pdtch.c:90)
20240131055940224 DSCHD NOTICE trxcon(0)[0x6305ae44faa0]: TS7-PDTCH gsm0503_pdtch_decode(): rc=23, ber=0/456 (sched_lchan_pdtch.c:90)
20240131055940247 DSCHD NOTICE trxcon(0)[0x6305ae44faa0]: TS7-PDTCH gsm0503_pdtch_decode(): rc=23, ber=0/456 (sched_lchan_pdtch.c:90)
20240131055940266 DSCHD NOTICE trxcon(0)[0x6305ae44faa0]: TS7-PDTCH gsm0503_pdtch_decode(): rc=23, ber=0/456 (sched_lchan_pdtch.c:90)
20240131055940290 DSCHD NOTICE trxcon(0)[0x6305ae44faa0]: TS7-PDTCH gsm0503_pdtch_decode(): rc=54, ber=0/444 (sched_lchan_pdtch.c:90)
20240131055940313 DSCHD NOTICE trxcon(0)[0x6305ae44faa0]: TS7-PDTCH gsm0503_pdtch_decode(): rc=54, ber=0/444 (sched_lchan_pdtch.c:90)
20240131055940332 DSCHD NOTICE trxcon(0)[0x6305ae44faa0]: TS7-PDTCH gsm0503_pdtch_decode(): rc=54, ber=0/444 (sched_lchan_pdtch.c:90)
20240131055940349 DSCHD NOTICE trxcon(0)[0x6305ae44faa0]: TS7-PDTCH gsm0503_pdtch_decode(): rc=54, ber=0/444 (sched_lchan_pdtch.c:90)
20240131055940373 DSCHD NOTICE trxcon(0)[0x6305ae44faa0]: TS7-PDTCH gsm0503_pdtch_decode(): rc=54, ber=0/444 (sched_lchan_pdtch.c:90)
20240131055940391 DSCHD NOTICE trxcon(0)[0x6305ae44faa0]: TS7-PDTCH gsm0503_pdtch_decode(): rc=23, ber=0/456 (sched_lchan_pdtch.c:90)
20240131055940410 DSCHD NOTICE trxcon(0)[0x6305ae44faa0]: TS7-PDTCH gsm0503_pdtch_decode(): rc=23, ber=0/456 (sched_lchan_pdtch.c:90)
20240131055940433 DSCHD NOTICE trxcon(0)[0x6305ae44faa0]: TS7-PDTCH gsm0503_pdtch_decode(): rc=23, ber=0/456 (sched_lchan_pdtch.c:90)
20240131055940451 DSCHD NOTICE trxcon(0)[0x6305ae44faa0]: TS7-PDTCH gsm0503_pdtch_decode(): rc=23, ber=0/456 (sched_lchan_pdtch.c:90)

In this case we see CS-1 (rc=23) and CS-4 (rc=54) encoded blocks being decoded without any bit errors, as expected (no puncturing in both cases).

Actions #24

Updated by fixeria 23 days ago

  • Status changed from In Progress to Feedback
  • Assignee changed from fixeria to Hoernchen

CS-4 is the least protected coding scheme:

  • the encoder takes 431 bits at the input:
    • 8 (MAC header) + 416 (RLC data) + 7 (spare bits);
  • the USF field (3 bits) is pre-coded to additional 12 bits (redundancy);
  • 16 parity bits are added (CRC16);
  • convolutional coding is not employed;

This also means that the receiving side does not know the exact amount of bit errors, unlike with CS-1, CS-2, and CS-3:

  • if the CRC16 check passes, libosmocore's decoder reports hard-coded BER=0/444 (456 - 12 pre-coded USF bits);
  • if the CRC16 check fails, libosmocore's decoder reports hard-coded BER=444/444, and returns -1.

So my best guess is that for some reason CS-4 encoded blocks are not passing the CRC16 check at the MS side.

Hoernchen this is as far as I can go with my analysis. The fact that we're always seeing at least 4 bit errors on Downlink (see #note-21) could indicate a PHY problem.
We definitely should not be seeing bit errors in lab testing conditions, and even a single bit error would not pass the CRC16 check for CS-4 coded MAC blocks.

I can easily reproduce those constant 4 bit errors by patching trxcon to print BER for CCCH/BCCH and running ccch_scan -a 979:

20240130233110737 DSCHD NOTICE trxcon(0)[0x556f637d20]: TS0-CCCH gsm0503_xcch_decode(): rc=0, ber=4/456 (sched_lchan_xcch.c:93)
20240130233110755 DSCHD NOTICE trxcon(0)[0x556f637d20]: TS0-CCCH gsm0503_xcch_decode(): rc=0, ber=4/456 (sched_lchan_xcch.c:93)
20240130233110926 DSCHD NOTICE trxcon(0)[0x556f637d20]: TS0-BCCH gsm0503_xcch_decode(): rc=0, ber=4/456 (sched_lchan_xcch.c:93)
20240130233110945 DSCHD NOTICE trxcon(0)[0x556f637d20]: TS0-CCCH gsm0503_xcch_decode(): rc=0, ber=4/456 (sched_lchan_xcch.c:93)
20240130233110972 DSCHD NOTICE trxcon(0)[0x556f637d20]: TS0-CCCH gsm0503_xcch_decode(): rc=0, ber=4/456 (sched_lchan_xcch.c:93)
20240130233110991 DSCHD NOTICE trxcon(0)[0x556f637d20]: TS0-CCCH gsm0503_xcch_decode(): rc=0, ber=4/456 (sched_lchan_xcch.c:93)
20240130233111161 DSCHD NOTICE trxcon(0)[0x556f637d20]: TS0-BCCH gsm0503_xcch_decode(): rc=0, ber=4/456 (sched_lchan_xcch.c:93)
20240130233111180 DSCHD NOTICE trxcon(0)[0x556f637d20]: TS0-CCCH gsm0503_xcch_decode(): rc=0, ber=4/456 (sched_lchan_xcch.c:93)
20240130233111207 DSCHD NOTICE trxcon(0)[0x556f637d20]: TS0-CCCH gsm0503_xcch_decode(): rc=0, ber=4/456 (sched_lchan_xcch.c:93)
20240130233111226 DSCHD NOTICE trxcon(0)[0x556f637d20]: TS0-CCCH gsm0503_xcch_decode(): rc=0, ber=4/456 (sched_lchan_xcch.c:93)
20240130233111397 DSCHD NOTICE trxcon(0)[0x556f637d20]: TS0-BCCH gsm0503_xcch_decode(): rc=0, ber=4/456 (sched_lchan_xcch.c:93)
20240130233111416 DSCHD NOTICE trxcon(0)[0x556f637d20]: TS0-CCCH gsm0503_xcch_decode(): rc=0, ber=4/456 (sched_lchan_xcch.c:93)
20240130233111443 DSCHD NOTICE trxcon(0)[0x556f637d20]: TS0-CCCH gsm0503_xcch_decode(): rc=0, ber=4/456 (sched_lchan_xcch.c:93)
20240130233111462 DSCHD NOTICE trxcon(0)[0x556f637d20]: TS0-CCCH gsm0503_xcch_decode(): rc=0, ber=4/456 (sched_lchan_xcch.c:93)
20240130233111632 DSCHD NOTICE trxcon(0)[0x556f637d20]: TS0-BCCH gsm0503_xcch_decode(): rc=0, ber=4/456 (sched_lchan_xcch.c:93)
20240130233111651 DSCHD NOTICE trxcon(0)[0x556f637d20]: TS0-CCCH gsm0503_xcch_decode(): rc=0, ber=4/456 (sched_lchan_xcch.c:93)
20240130233111678 DSCHD NOTICE trxcon(0)[0x556f637d20]: TS0-CCCH gsm0503_xcch_decode(): rc=0, ber=4/456 (sched_lchan_xcch.c:93)
20240130233111697 DSCHD NOTICE trxcon(0)[0x556f637d20]: TS0-CCCH gsm0503_xcch_decode(): rc=0, ber=4/456 (sched_lchan_xcch.c:93)
20240130233111868 DSCHD NOTICE trxcon(0)[0x556f637d20]: TS0-BCCH gsm0503_xcch_decode(): rc=0, ber=4/456 (sched_lchan_xcch.c:93)
Actions #25

Updated by fixeria 22 days ago

fixeria wrote in #note-22:

So we should be interpreting the reported BER values with a grain of salt. We need to fix libosmocoding, I will report a separate ticket for this.

FYI, I've submitted patches fixing BER reporting for PDTCH CS2 and CS3:

https://gerrit.osmocom.org/c/libosmocore/+/35751 utils/conv_codes_gsm.py: fix inconsistent formatting [NEW]
https://gerrit.osmocom.org/c/libosmocore/+/35752 coding: fix a typo in docs for gsm0503_pdtch_decode() [NEW]
https://gerrit.osmocom.org/c/libosmocore/+/35753 coding: improve readability in osmo_conv_decode_ber_punctured() [NEW]
https://gerrit.osmocom.org/c/libosmocore/+/35754 coding: fix artificial bit errors for PDTCH CS2 and CS3 [NEW]
https://gerrit.osmocom.org/c/libosmocore/+/35755 coding: fix wrong n_bits_total reported for PDTCH CS2 and CS3 [NEW]

Actions #26

Updated by pespin 17 days ago

After fixing bug with UL CS4 last RLC block not containing multiple LLCs (already merged in libosmo-gprs), I did some iperf3 tests.

The table at https://en.wikipedia.org/wiki/General_Packet_Radio_Service#Channel_encoding provides some hints on the expected throughput.

First, I also had to fix the UL CS selection thresholds at osmo-pcu.cfg to adapt it to "lower-than-usually-expected" values provided by osmo-trx. This is a good start, since the UE phy setup the delivered C/I is usually 6-7 dB.

cs link-quality-ranges cs1 3 cs2 2 5 cs3 4 7 cs4 5

Then, I changed the initial CS to be the maximum possible to test best conditions. I still kep DL CS3 because DL CS4 is known to be broken (see this ticket and #6200):

 cs 3 4
 ! CS5 DL decoding is broken at osmocom "modem" app right now:
 !cs max 4
 cs max 3 4

With that setup in place, using iperf3 I get in UL 18.8 kbps for UDP and 16.0 kbps for TCP, which goes more or less inline with the 20kbps/slot in CS4 from table above.

For downlink, TCP DL iperf test running for ~30min gave an average 11kbps for CS3, which more or less matches the 14.4kbps max on top of RLC/MAC according to the table above.

One should think that the figures from the table don't take into account IP+TCP overhead, etc.

So, to me what's left is mostly fixing CS4 decoding in DL.

Actions #27

Updated by fixeria 17 days ago

All patches for trxcon/{l1sched,l1gprs} and libosmo-gprs.git have been merged. I deployed them on the mssdr-ms host, and submitted this patch (also deployed):

https://gerrit.osmocom.org/c/osmo-trx/+/35859 osmo-trx-ms: bump osmocom-bb submodule commit

Unfortunately, I can no longer test anything because osmo-trx-ms hangs:

ubuntu@mssdr-ms:~/cni/osmo-trx$ ./Transceiver52M/osmo-trx-ms-blade 
scheduling for: pi4
[DEBUG @ host/libraries/libbladeRF/src/backend/usb/libusb.c:191] Bus 002 Device 002: Nuand bladeRF 2.0, serial 253d3d45e7bc4642b65852e5f5dfb584
[DEBUG @ host/libraries/libbladeRF/src/driver/ina219.c:70] Configuration register: 0x019f
[DEBUG @ host/libraries/libbladeRF/src/driver/ina219.c:82] Calibration register: 0xa000
[DEBUG @ host/libraries/libbladeRF/src/board/bladerf2/common.c:281] Default tuning mode: Host
[DEBUG @ host/libraries/libbladeRF/src/board/bladerf2/bladerf2.c:2329] bladerf2_set_tuning_mode: New tuning mode: Host
[DEBUG @ host/libraries/libbladeRF/src/board/bladerf2/bladerf2.c:2365] bladerf2_set_tuning_mode: Releasing FPGA RFIC control
[DEBUG @ host/libraries/libbladeRF/src/board/bladerf2/rfic_fpga.c:251] _rfic_fpga_deinitialize: deinitializing
[DEBUG @ host/libraries/libbladeRF/src/board/bladerf2/bladerf2.c:2380] bladerf2_set_tuning_mode: Initializing Host RFIC control
[DEBUG @ host/libraries/libbladeRF/src/board/bladerf2/rfic_host.c:85] _rfic_host_initialize: initializating
[DEBUG @ host/libraries/libbladeRF/src/board/bladerf2/rfic_host.c:154] _rfic_host_initialize: initialization complete
[DEBUG @ host/libraries/libbladeRF/src/board/bladerf2/bladerf2.c:207] _bladerf2_initialize: complete
[DEBUG @ host/libraries/libbladeRF/src/board/bladerf2/bladerf2.c:2329] bladerf2_set_tuning_mode: New tuning mode: FPGA
[DEBUG @ host/libraries/libbladeRF/src/board/bladerf2/bladerf2.c:2365] bladerf2_set_tuning_mode: Releasing Host RFIC control
[DEBUG @ host/libraries/libbladeRF/src/board/bladerf2/rfic_host.c:164] _rfic_host_deinitialize: deinitializing
[DEBUG @ host/libraries/libbladeRF/src/board/bladerf2/bladerf2.c:2380] bladerf2_set_tuning_mode: Initializing FPGA RFIC control
[DEBUG @ host/libraries/libbladeRF/src/board/bladerf2/rfic_fpga.c:233] _rfic_fpga_initialize: initializing
[DEBUG @ host/libraries/libbladeRF/src/board/bladerf2/bladerf2.c:2611] trim DAC is enabled
[DEBUG @ host/libraries/libbladeRF/src/board/bladerf2/bladerf2.c:2644] saving current trim DAC value: 0x1ee7
[DEBUG @ host/libraries/libbladeRF/src/board/bladerf2/bladerf2.c:1083] bladerf2_set_sample_rate: enabling 4x decimation/interpolation filters
20240205165337023 DL1C NOTICE Init L1CTL server (sock_path=/tmp/osmocom_l2) (l1ctl_server.c:231)
started U control!
started U tx!
20240205165341846 DL1C NOTICE L1CTL server got a new connection (id=0) (l1ctl_server.c:168)
20240205165341847 DL1C NOTICE trxcon(0)[0x5579e64ea0]{RESET}: Received reset request (FULL) (l1ctl.c:439)
20240205165341847 DSCH NOTICE trxcon(0)[0x5579e64ea0]: Reset scheduler and clock counter (sched_trx.c:190)
20240205165341848 DL1C NOTICE trxcon(0)[0x5579e64ea0]{RESET}: Received FBSB request (GSM900 979, timeout 100 TDMA FNs) (l1ctl.c:380)
20240205165341848 DSCH NOTICE trxcon(0)[0x5579e64ea0]: Add a new TDMA timeslot #0 (sched_trx.c:207)
20240205165341848 DSCH NOTICE trxcon(0)[0x5579e64ea0]: (Re)configure TDMA timeslot #0 as BCCH+CCCH (sched_trx.c:276)
20240205165341848 DSCH NOTICE trxcon(0)[0x5579e64ea0]: TS0-SCH activating (sched_trx.c:476)
20240205165341848 DSCH NOTICE trxcon(0)[0x5579e64ea0]: TS0-BCCH activating (sched_trx.c:476)
20240205165341848 DSCH NOTICE trxcon(0)[0x5579e64ea0]: TS0-RACH activating (sched_trx.c:476)
20240205165341848 DSCH NOTICE trxcon(0)[0x5579e64ea0]: TS0-CCCH activating (sched_trx.c:476)
 #RXG  gain 30 -> 34 sample avg:2.10159 target: >=255
 #RXG  gain 34 -> 38 sample avg:3.21183 target: >=255
 #RXG  gain 38 -> 42 sample avg:5.11048 target: >=255
 #RXG  gain 42 -> 46 sample avg:7.74937 target: >=255
 #RXG  gain 46 -> 50 sample avg:11.8049 target: >=255
 #RXG  gain 50 -> 54 sample avg:17.4051 target: >=255
 #RXG  gain 54 -> 58 sample avg:26.9121 target: >=255
 #RXG  gain 58 -> 62 sample avg:42.8535 target: >=255
[DEBUG @ host/common/src/range.c:67] clamp_to_range: Requested value 62 is above range [-16,60], clamping to 60
^Cexit U tx!
exit U control!
^C^C^C^C^CTerminated

Ctrl + C does not help, you actually need to kill it. Leaving this for Hoernchen.
At this point I am also stopping to work on the project, ping me if anything is needed.

Actions

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)