Bug #6102
closedFeature #5500: MS-Side GPRS RLC/MAC implementation
libosmo-gprs-rlcmac: DL ACK/NACK endless loop again PCU after 128 bsns received
100%
Description
While testing the "modem" app against osmo-pcu, ping-pong works fine until PCU submits an entire round of 128 BSNs (SNS=128). At that point in time, when PCU starts reusing BSN=0 to sends new data, the libosmo-gprs-rlcmac implementation becomes stuck always answering the DL_ACK/NACK containing STARTING_SEQUENCE_NUMBER=0 and an RRB full of zeroes.
As a result, the PCU keeps resending the same blocks endlessly. I attach a pcap showing the problem. By filtering for DL_ACK/NACK message, once can see the full loop BSN=0..BSN=127, and how the problem starts when PCU starts reusing BSN=0 again.
Ack_Nack_Description ...0 .... = FINAL_ACK_INDICATION: False .... 0000 000. .... = STARTING_SEQUENCE_NUMBER: 0 ...0 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 000. .... = Received Block Bitmap: 0 ...0 .... = Exist_Channel_Request_Description: 0
Here's a quick status taken with gdb with breakpoint on gprs_rlcmac_dl_tbf_rcv_data_block() while already in the faulty state:
Breakpoint 2, gprs_rlcmac_dl_tbf_rcv_data_block (dl_tbf=dl_tbf@entry=0x613000014b60, rlc=rlc@entry=0x7ffff3662a20, data=0x6160019c5d9a "\a", fn=fn@entry=213356, ts_nr=ts_nr@entry=5 '\005') at /home/pespin/dev/sysmocom/git/libosmo-gprs/src/rlcmac/tbf_dl.c:266 266 { (gdb) n 270 const uint16_t ws = gprs_rlcmac_rlc_window_ws(dl_tbf->w); (gdb) n 272 LOGPTBFDL(dl_tbf, LOGL_DEBUG, "DL DATA TFI=%d received (V(Q)=%d .. V(R)=%d)\n", (gdb) print ws $1 = 64 (gdb) n 278 gprs_rlcmac_dl_tbf_t3190_start(dl_tbf); (gdb) print gprs_rlcmac_rlc_dl_window_v_q(dl_tbf->dlw) $2 = 0 (gdb) print gprs_rlcmac_rlc_dl_window_v_r(dl_tbf->dlw) $3 = 0 (gdb) n 281 for (block_idx = 0; block_idx < rlc->num_data_blocks; block_idx++) { (gdb) n 283 rdbi = &rlc->block_info[block_idx]; (gdb) n 285 LOGPTBFDL(dl_tbf, LOGL_DEBUG, (gdb) n 294 if (!gprs_rlcmac_rlc_dl_window_is_in_window(dl_tbf->dlw, rdbi->bsn)) { (gdb) print *rdbi $4 = {data_len = 20, bsn = 22, ti = 0, e = 1, cv = 255, pi = 0, spb = 0} (gdb) print gprs_rlcmac_rlc_dl_window_is_in_window(dl_tbf->dlw, rdbi->bsn) $5 = true (gdb) n 300 } else if (gprs_rlcmac_rlc_v_n_is_received(&dl_tbf->dlw->v_n, rdbi->bsn)) { (gdb) print gprs_rlcmac_rlc_v_n_is_received(&dl_tbf->dlw->v_n, rdbi->bsn) $6 = true (gdb) n 301 LOGPTBFDL(dl_tbf, LOGL_DEBUG, "BSN %d already received\n", rdbi->bsn); (gdb) print dl_tbf->dlw->v_n $7 = {v_n = {GPRS_RLCMAC_RLC_DL_BSN_RECEIVED <repeats 128 times>, GPRS_RLCMAC_RLC_DL_BSN_INVALID <repeats 896 times>}} (gdb) print *dl_tbf->dlw $9 = {window = {sns = 128, ws = 64}, dl_tbf = 0x613000014b60, v_r = 0, v_q = 0, v_n = {v_n = {GPRS_RLCMAC_RLC_DL_BSN_RECEIVED <repeats 128 times>, GPRS_RLCMAC_RLC_DL_BSN_INVALID <repeats 896 times>}}}
Related spec: TS 44.060 12.3 "Ack/Nack Description"
Files
Updated by pespin 9 months ago
I'm attaching the exact same pcap as above but now with the gsmtap_log included. The interesting part starts at frame_nr 343934 "GPRS UL CTRL: PACKET_DOWNLINK_ACK_NACK" which is the last DL_ACK/NACK before overflowing the BSN.
When reused BSN=0 is received:
346124 16:57:52.691720331 127.0.0.1 38705 127.0.0.1 4729 GSMTAP 184 TBF(DL:NR-2:TLLI-e6020604) Rx new DL data 346125 16:57:52.691751339 127.0.0.1 38705 127.0.0.1 4729 GSMTAP 211 TBF(DL:NR-2:TLLI-e6020604) DL DATA TFI=0 received (V(Q)=0 .. V(R)=0) 346126 16:57:52.691771458 127.0.0.1 38705 127.0.0.1 4729 GSMTAP 245 TBF(DL:NR-2:TLLI-e6020604) Got CS-4 RLC data block: FBI=0, BSN=0, SPB=0, S/P=0 RRBP=0, E=1, bitoffs=24 346127 16:57:52.691780296 127.0.0.1 38705 127.0.0.1 4729 GSMTAP 192 TBF(DL:NR-2:TLLI-e6020604) BSN 0 already received 346128 16:57:52.691792779 127.0.0.1 38705 127.0.0.1 4729 GSMTAP 222 TBF(DL:NR-2:TLLI-e6020604) No gaps in received block, last block: BSN=127 FBI=0
During the initial BSN=0 (at the start of the DL TBF), the log says (frame_nr 22091) "BSN 0 storing in window (0..63) "
So it seems the DL TBF is not "releasing" old BSNs which were already received and pushed to upper layers and keeps marking them as received.
Updated by pespin 9 months ago
Seems the fix should be done somehow releasing the BSN=0 when previous BSN=127 is received, around lines:
345481 16:57:52.613529009 127.0.0.1 38705 127.0.0.1 4729 GSMTAP 162 - Raising V(R) to 0 345483 16:57:52.613566161 127.0.0.1 38705 127.0.0.1 4729 GSMTAP 184 - Taking block 127 out, raising V(Q) to 0
I tried the following patch, and then the previous step seems to be overcome, but then it gets stuck again after reaching START_SEQUENCE=64 with all RRBPs to 1 this time:
diff --git a/src/rlcmac/rlc_window_dl.c b/src/rlcmac/rlc_window_dl.c index a2bef5f..ef2d689 100644 --- a/src/rlcmac/rlc_window_dl.c +++ b/src/rlcmac/rlc_window_dl.c @@ -225,6 +225,7 @@ void gprs_rlcmac_rlc_dl_window_raise_v_r(struct gprs_rlcmac_rlc_dl_window *dlw, gprs_rlcmac_rlc_dl_window_raise_v_r_to(dlw, 1); } LOGRLCMAC(LOGL_DEBUG, "- Raising V(R) to %d\n", gprs_rlcmac_rlc_dl_window_v_r(dlw)); + gprs_rlcmac_rlc_v_n_mark_missing(&dlw->v_n, gprs_rlcmac_rlc_dl_window_v_r(dlw)); } }
I think osmo-pcu also may have a similar bug on the UL, but it's harder to spot there since the UL TBF cannot be kept open forever as easily as DL TBF and hence it may close in general before submitting 128 BSNs?
Updated by pespin 9 months ago
- Status changed from New to In Progress
I encountered this bug I introduced while porting code from osmo-pcu:
https://gerrit.osmocom.org/c/libosmo-gprs/+/33885
Updated by pespin 9 months ago
- Status changed from In Progress to Feedback
- % Done changed from 0 to 50
Aside from the porting bug I introduced, there seems to be an extra problem I spotted, and which is fixed here:
https://gerrit.osmocom.org/c/libosmo-gprs/+/33886 rlcmac: Mark received BSNs falling out of the V(N)/RBB when V(R) is raised
I so far believe this bug is also present in osmo-pcu, and we should write a TTCN-3 test to validate whether the issue indeed exists, then apply a similar patch in osmo-pcu ul_window.
The test would have to submit more than 128 UL data blocks and see if osmo-pcu thinks BSN=0 is already received and keeps asking for it in PKT UL ACK/NACK.
Updated by pespin 9 months ago
I started writting a TTCN3 PCU_Tests test to validate whether wraparound was going correct or there was a bug, and it seems the wraparound is wroking fine.
Further investigation found out that in the condition which was reoslving the BSN in libosmo-gprs-rlcmac ""BSN 0 already received" is slightly different in original osmo-pcu code and in the ported code (probably an overlook on my side when porting).
Original osmo-pcu code is using in gprs_rlcmac_ul_tbf::rcv_data_block_acknowledged():
else if (m_window.is_received(rdbi->bsn)) { LOGPTBFUL(this, LOGL_DEBUG, "BSN %d already received\n", rdbi->bsn); continue; }
While libosmo-gprs-rlcmac code is using in gprs_rlcmac_dl_tbf_rcv_data_block():
else if (gprs_rlcmac_rlc_v_n_is_received(&dl_tbf->dlw->v_n, rdbi->bsn)) { LOGPTBFDL(dl_tbf, LOGL_DEBUG, "BSN %d already received\n", rdbi->bsn); continue; }
But it should actually be calling gprs_rlcmac_rlc_dl_window_is_received() instead, which applies several more restrictions before returning null:
bool gprs_rlcmac_rlc_dl_window_is_received(const struct gprs_rlcmac_rlc_dl_window *dlw, uint16_t bsn) { const struct gprs_rlcmac_rlc_window *w = rlc_dlw_as_w_const(dlw); uint16_t offset_v_r; /* Offset to the end of the received window */ offset_v_r = (dlw->v_r - 1 - bsn) & gprs_rlcmac_rlc_window_mod_sns(w); return gprs_rlcmac_rlc_dl_window_is_in_window(dlw, bsn) && gprs_rlcmac_rlc_v_n_is_received(&dlw->v_n, bsn) && offset_v_r < gprs_rlcmac_rlc_window_ws(w); }
The first condition "gprs_rlcmac_rlc_dl_window_is_in_window(dlw, bsn)" we don't really case because it's the same condition driving us to this code path, so we know gprs_rlcmac_rlc_dl_window_is_in_window() is always true, so no need to call it. The second condition "gprs_rlcmac_rlc_v_n_is_received(&dlw->v_n, bsn)" is the one we are already using in libosmo-gprs-rlcmac. The third condition "offset_v_r < gprs_rlcmac_rlc_window_ws(w)" was the one taking care of not resolving the BSN as received in this wraparound case, since v_r=0, bsn=0: "(0 - 1 - 0) & 127" ----> 0xffff & 0x007f ---> 0x007f > 64.
However, I'm so far becoming to like more the fix I implemented for libosmo-gprs-rlcmac in https://gerrit.osmocom.org/c/libosmo-gprs/+/33886, since it keeps the good state in the V(N) array instead of doing heuristics to find out if a BSN is received after having gone through the first BSN wraparound. So I may actually apply the same fix in osmo-pcu, and change the following:
diff --git a/src/rlc.cpp b/src/rlc.cpp index a0825976..8b6b027d 100644 --- a/src/rlc.cpp +++ b/src/rlc.cpp @@ -268,8 +268,14 @@ void gprs_rlc_ul_window::raise_v_r(const uint16_t bsn) /* Positive offset, so raise. */ if (offset_v_r < (sns() >> 1)) { while (offset_v_r--) { - if (offset_v_r) /* all except the received block */ - m_v_n.mark_missing(v_r()); + const uint16_t _v_r = v_r(); + const uint16_t bsn_no_longer_in_ws = mod_sns(_v_r - ws()); + //LOGP(DRLCMACUL, LOGL_DEBUG, "- Mark BSN %u as INVALID\n", bsn_no_longer_in_ws); + m_v_n.mark_invalid(bsn_no_longer_in_ws); + if (offset_v_r) {/* all except the received block */ + //LOGP(DRLCMACUL, LOGL_DEBUG, "- Mark BSN %u as MISSING\n", _v_r); + m_v_n.mark_missing(_v_r); + } raise_v_r_to(1); } LOGP(DRLCMACUL, LOGL_DEBUG, "- Raising V(R) to %d\n", v_r()); diff --git a/src/rlc.h b/src/rlc.h index d4a7a5e1..0a297336 100644 --- a/src/rlc.h +++ b/src/rlc.h @@ -329,6 +329,7 @@ struct gprs_rlc_v_n { void mark_received(int bsn); void mark_missing(int bsn); + void mark_invalid(int bsn); bool is_received(int bsn) const; @@ -649,6 +650,11 @@ inline void gprs_rlc_v_n::mark_missing(int bsn) return mark(bsn, GPRS_RLC_UL_BSN_MISSING); } +inline void gprs_rlc_v_n::mark_invalid(int bsn) +{ + return mark(bsn, GPRS_RLC_UL_BSN_INVALID); +} + inline bool gprs_rlc_v_n::is_received(int bsn) const { return is_state(bsn, GPRS_RLC_UL_BSN_RECEIVED); diff --git a/src/tbf_ul.cpp b/src/tbf_ul.cpp index 5ae0d495..3746130f 100644 --- a/src/tbf_ul.cpp +++ b/src/tbf_ul.cpp @@ -303,7 +303,7 @@ int gprs_rlcmac_ul_tbf::rcv_data_block_acknowledged( rdbi->bsn, m_window.v_q(), m_window.mod_sns(m_window.v_q() + ws - 1)); continue; - } else if (m_window.is_received(rdbi->bsn)) { + } else if (m_window.m_v_n.is_received(rdbi->bsn)) { LOGPTBFUL(this, LOGL_DEBUG, "BSN %d already received\n", rdbi->bsn); continue;
Updated by pespin 9 months ago
- % Done changed from 50 to 90
osmo-pcu port of libosmo-gprs-rlcmac patch:
remote: https://gerrit.osmocom.org/c/osmo-pcu/+/33922 gprs_rlc_ul_window: Make sure V(N) array is cleared during reset_state()
remote: https://gerrit.osmocom.org/c/osmo-pcu/+/33924 gprs_rlc_ul_window: Mark received BSNs falling out of the V(N)/RBB when V(R) ...
ttcn3 test:
https://gerrit.osmocom.org/c/osmo-ttcn3-hacks/+/33923 pcu: Introduce test TC_ul_tbf_bsn_wraparound_gprs