Project

General

Profile

Actions

Bug #6102

closed

Feature #5500: MS-Side GPRS RLC/MAC implementation

libosmo-gprs-rlcmac: DL ACK/NACK endless loop again PCU after 128 bsns received

Added by pespin 9 months ago. Updated 9 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
07/20/2023
Due date:
% Done:

100%

Resolution:
Spec Reference:
TS 44.060

Description

While testing the "modem" app against osmo-pcu, ping-pong works fine until PCU submits an entire round of 128 BSNs (SNS=128). At that point in time, when PCU starts reusing BSN=0 to sends new data, the libosmo-gprs-rlcmac implementation becomes stuck always answering the DL_ACK/NACK containing STARTING_SEQUENCE_NUMBER=0 and an RRB full of zeroes.

As a result, the PCU keeps resending the same blocks endlessly. I attach a pcap showing the problem. By filtering for DL_ACK/NACK message, once can see the full loop BSN=0..BSN=127, and how the problem starts when PCU starts reusing BSN=0 again.

    Ack_Nack_Description
        ...0 .... = FINAL_ACK_INDICATION: False
        .... 0000  000. .... = STARTING_SEQUENCE_NUMBER: 0
        ...0 0000  0000 0000  0000 0000  0000 0000  0000 0000  0000 0000  0000 0000  0000 0000  000. .... = Received Block Bitmap: 0
    ...0 .... = Exist_Channel_Request_Description: 0

Here's a quick status taken with gdb with breakpoint on gprs_rlcmac_dl_tbf_rcv_data_block() while already in the faulty state:

Breakpoint 2, gprs_rlcmac_dl_tbf_rcv_data_block (dl_tbf=dl_tbf@entry=0x613000014b60, rlc=rlc@entry=0x7ffff3662a20, data=0x6160019c5d9a "\a",
    fn=fn@entry=213356, ts_nr=ts_nr@entry=5 '\005') at /home/pespin/dev/sysmocom/git/libosmo-gprs/src/rlcmac/tbf_dl.c:266
266    {
(gdb) n
270            const uint16_t ws = gprs_rlcmac_rlc_window_ws(dl_tbf->w);
(gdb) n
272            LOGPTBFDL(dl_tbf, LOGL_DEBUG, "DL DATA TFI=%d received (V(Q)=%d .. V(R)=%d)\n",
(gdb) print ws
$1 = 64
(gdb) n
278            gprs_rlcmac_dl_tbf_t3190_start(dl_tbf);
(gdb) print gprs_rlcmac_rlc_dl_window_v_q(dl_tbf->dlw)
$2 = 0
(gdb) print gprs_rlcmac_rlc_dl_window_v_r(dl_tbf->dlw)
$3 = 0
(gdb) n
281            for (block_idx = 0; block_idx < rlc->num_data_blocks; block_idx++) {
(gdb) n
283                        rdbi = &rlc->block_info[block_idx];
(gdb) n
285                        LOGPTBFDL(dl_tbf, LOGL_DEBUG,
(gdb) n
294                        if (!gprs_rlcmac_rlc_dl_window_is_in_window(dl_tbf->dlw, rdbi->bsn)) {
(gdb) print *rdbi
$4 = {data_len = 20, bsn = 22, ti = 0, e = 1, cv = 255, pi = 0, spb = 0}
(gdb) print gprs_rlcmac_rlc_dl_window_is_in_window(dl_tbf->dlw, rdbi->bsn)
$5 = true
(gdb) n
300                        } else if (gprs_rlcmac_rlc_v_n_is_received(&dl_tbf->dlw->v_n, rdbi->bsn)) {
(gdb) print gprs_rlcmac_rlc_v_n_is_received(&dl_tbf->dlw->v_n, rdbi->bsn)
$6 = true
(gdb) n
301                                LOGPTBFDL(dl_tbf, LOGL_DEBUG, "BSN %d already received\n", rdbi->bsn);
(gdb) print dl_tbf->dlw->v_n
$7 = {v_n = {GPRS_RLCMAC_RLC_DL_BSN_RECEIVED <repeats 128 times>, GPRS_RLCMAC_RLC_DL_BSN_INVALID <repeats 896 times>}}
(gdb) print *dl_tbf->dlw
$9 = {window = {sns = 128, ws = 64}, dl_tbf = 0x613000014b60, v_r = 0, v_q = 0, v_n = {v_n = {GPRS_RLCMAC_RLC_DL_BSN_RECEIVED <repeats 128 times>,
      GPRS_RLCMAC_RLC_DL_BSN_INVALID <repeats 896 times>}}}

Related spec: TS 44.060 12.3 "Ack/Nack Description"


Files

Actions #1

Updated by pespin 9 months ago

I'm attaching the exact same pcap as above but now with the gsmtap_log included. The interesting part starts at frame_nr 343934 "GPRS UL CTRL: PACKET_DOWNLINK_ACK_NACK" which is the last DL_ACK/NACK before overflowing the BSN.

When reused BSN=0 is received:

346124    16:57:52.691720331    127.0.0.1    38705    127.0.0.1    4729    GSMTAP    184    TBF(DL:NR-2:TLLI-e6020604) Rx new DL data 
346125    16:57:52.691751339    127.0.0.1    38705    127.0.0.1    4729    GSMTAP    211    TBF(DL:NR-2:TLLI-e6020604) DL DATA TFI=0 received (V(Q)=0 .. V(R)=0) 
346126    16:57:52.691771458    127.0.0.1    38705    127.0.0.1    4729    GSMTAP    245    TBF(DL:NR-2:TLLI-e6020604) Got CS-4 RLC data block: FBI=0, BSN=0, SPB=0, S/P=0 RRBP=0, E=1, bitoffs=24 
346127    16:57:52.691780296    127.0.0.1    38705    127.0.0.1    4729    GSMTAP    192    TBF(DL:NR-2:TLLI-e6020604) BSN 0 already received 
346128    16:57:52.691792779    127.0.0.1    38705    127.0.0.1    4729    GSMTAP    222    TBF(DL:NR-2:TLLI-e6020604) No gaps in received block, last block: BSN=127 FBI=0 

During the initial BSN=0 (at the start of the DL TBF), the log says (frame_nr 22091) "BSN 0 storing in window (0..63) "

So it seems the DL TBF is not "releasing" old BSNs which were already received and pushed to upper layers and keeps marking them as received.

Actions #2

Updated by pespin 9 months ago

Seems the fix should be done somehow releasing the BSN=0 when previous BSN=127 is received, around lines:

345481    16:57:52.613529009    127.0.0.1    38705    127.0.0.1    4729    GSMTAP    162    - Raising V(R) to 0 
345483    16:57:52.613566161    127.0.0.1    38705    127.0.0.1    4729    GSMTAP    184    - Taking block 127 out, raising V(Q) to 0 

I tried the following patch, and then the previous step seems to be overcome, but then it gets stuck again after reaching START_SEQUENCE=64 with all RRBPs to 1 this time:

diff --git a/src/rlcmac/rlc_window_dl.c b/src/rlcmac/rlc_window_dl.c
index a2bef5f..ef2d689 100644
--- a/src/rlcmac/rlc_window_dl.c
+++ b/src/rlcmac/rlc_window_dl.c
@@ -225,6 +225,7 @@ void gprs_rlcmac_rlc_dl_window_raise_v_r(struct gprs_rlcmac_rlc_dl_window *dlw,
                        gprs_rlcmac_rlc_dl_window_raise_v_r_to(dlw, 1);
                }
                LOGRLCMAC(LOGL_DEBUG, "- Raising V(R) to %d\n", gprs_rlcmac_rlc_dl_window_v_r(dlw));
+               gprs_rlcmac_rlc_v_n_mark_missing(&dlw->v_n, gprs_rlcmac_rlc_dl_window_v_r(dlw));
        }
 }

I think osmo-pcu also may have a similar bug on the UL, but it's harder to spot there since the UL TBF cannot be kept open forever as easily as DL TBF and hence it may close in general before submitting 128 BSNs?

Actions #3

Updated by pespin 9 months ago

  • Status changed from New to In Progress

I encountered this bug I introduced while porting code from osmo-pcu:
https://gerrit.osmocom.org/c/libosmo-gprs/+/33885

Actions #4

Updated by pespin 9 months ago

  • Status changed from In Progress to Feedback
  • % Done changed from 0 to 50

Aside from the porting bug I introduced, there seems to be an extra problem I spotted, and which is fixed here:
https://gerrit.osmocom.org/c/libosmo-gprs/+/33886 rlcmac: Mark received BSNs falling out of the V(N)/RBB when V(R) is raised

I so far believe this bug is also present in osmo-pcu, and we should write a TTCN-3 test to validate whether the issue indeed exists, then apply a similar patch in osmo-pcu ul_window.

The test would have to submit more than 128 UL data blocks and see if osmo-pcu thinks BSN=0 is already received and keeps asking for it in PKT UL ACK/NACK.

Actions #5

Updated by pespin 9 months ago

I started writting a TTCN3 PCU_Tests test to validate whether wraparound was going correct or there was a bug, and it seems the wraparound is wroking fine.

Further investigation found out that in the condition which was reoslving the BSN in libosmo-gprs-rlcmac ""BSN 0 already received" is slightly different in original osmo-pcu code and in the ported code (probably an overlook on my side when porting).

Original osmo-pcu code is using in gprs_rlcmac_ul_tbf::rcv_data_block_acknowledged():

else if (m_window.is_received(rdbi->bsn)) {
            LOGPTBFUL(this, LOGL_DEBUG,
                  "BSN %d already received\n", rdbi->bsn);
            continue;
}

While libosmo-gprs-rlcmac code is using in gprs_rlcmac_dl_tbf_rcv_data_block():

else if (gprs_rlcmac_rlc_v_n_is_received(&dl_tbf->dlw->v_n, rdbi->bsn)) {
            LOGPTBFDL(dl_tbf, LOGL_DEBUG, "BSN %d already received\n", rdbi->bsn);
            continue;
}

But it should actually be calling gprs_rlcmac_rlc_dl_window_is_received() instead, which applies several more restrictions before returning null:

bool gprs_rlcmac_rlc_dl_window_is_received(const struct gprs_rlcmac_rlc_dl_window *dlw, uint16_t bsn)
{
    const struct gprs_rlcmac_rlc_window *w = rlc_dlw_as_w_const(dlw);
    uint16_t offset_v_r;

    /* Offset to the end of the received window */
    offset_v_r = (dlw->v_r - 1 - bsn) & gprs_rlcmac_rlc_window_mod_sns(w);
    return gprs_rlcmac_rlc_dl_window_is_in_window(dlw, bsn) &&
           gprs_rlcmac_rlc_v_n_is_received(&dlw->v_n, bsn) &&
           offset_v_r < gprs_rlcmac_rlc_window_ws(w);
}

The first condition "gprs_rlcmac_rlc_dl_window_is_in_window(dlw, bsn)" we don't really case because it's the same condition driving us to this code path, so we know gprs_rlcmac_rlc_dl_window_is_in_window() is always true, so no need to call it. The second condition "gprs_rlcmac_rlc_v_n_is_received(&dlw->v_n, bsn)" is the one we are already using in libosmo-gprs-rlcmac. The third condition "offset_v_r < gprs_rlcmac_rlc_window_ws(w)" was the one taking care of not resolving the BSN as received in this wraparound case, since v_r=0, bsn=0: "(0 - 1 - 0) & 127" ----> 0xffff & 0x007f ---> 0x007f > 64.

However, I'm so far becoming to like more the fix I implemented for libosmo-gprs-rlcmac in https://gerrit.osmocom.org/c/libosmo-gprs/+/33886, since it keeps the good state in the V(N) array instead of doing heuristics to find out if a BSN is received after having gone through the first BSN wraparound. So I may actually apply the same fix in osmo-pcu, and change the following:

diff --git a/src/rlc.cpp b/src/rlc.cpp
index a0825976..8b6b027d 100644
--- a/src/rlc.cpp
+++ b/src/rlc.cpp
@@ -268,8 +268,14 @@ void gprs_rlc_ul_window::raise_v_r(const uint16_t bsn)
        /* Positive offset, so raise. */
        if (offset_v_r < (sns() >> 1)) {
                while (offset_v_r--) {
-                       if (offset_v_r) /* all except the received block */
-                               m_v_n.mark_missing(v_r());
+                       const uint16_t _v_r = v_r();
+                       const uint16_t bsn_no_longer_in_ws = mod_sns(_v_r - ws());
+                       //LOGP(DRLCMACUL, LOGL_DEBUG, "- Mark BSN %u as INVALID\n", bsn_no_longer_in_ws);
+                       m_v_n.mark_invalid(bsn_no_longer_in_ws);
+                       if (offset_v_r) {/* all except the received block */
+                               //LOGP(DRLCMACUL, LOGL_DEBUG, "- Mark BSN %u as MISSING\n", _v_r);
+                               m_v_n.mark_missing(_v_r);
+                       }
                        raise_v_r_to(1);
                }
                LOGP(DRLCMACUL, LOGL_DEBUG, "- Raising V(R) to %d\n", v_r());
diff --git a/src/rlc.h b/src/rlc.h
index d4a7a5e1..0a297336 100644
--- a/src/rlc.h
+++ b/src/rlc.h
@@ -329,6 +329,7 @@ struct gprs_rlc_v_n {

        void mark_received(int bsn);
        void mark_missing(int bsn);
+       void mark_invalid(int bsn);

        bool is_received(int bsn) const;

@@ -649,6 +650,11 @@ inline void gprs_rlc_v_n::mark_missing(int bsn)
        return mark(bsn, GPRS_RLC_UL_BSN_MISSING);
 }

+inline void gprs_rlc_v_n::mark_invalid(int bsn)
+{
+       return mark(bsn, GPRS_RLC_UL_BSN_INVALID);
+}
+
 inline bool gprs_rlc_v_n::is_received(int bsn) const
 {
        return is_state(bsn, GPRS_RLC_UL_BSN_RECEIVED);
diff --git a/src/tbf_ul.cpp b/src/tbf_ul.cpp
index 5ae0d495..3746130f 100644
--- a/src/tbf_ul.cpp
+++ b/src/tbf_ul.cpp
@@ -303,7 +303,7 @@ int gprs_rlcmac_ul_tbf::rcv_data_block_acknowledged(
                                  rdbi->bsn,
                                  m_window.v_q(), m_window.mod_sns(m_window.v_q() + ws - 1));
                        continue;
-               } else if (m_window.is_received(rdbi->bsn)) {
+               } else if (m_window.m_v_n.is_received(rdbi->bsn)) {
                        LOGPTBFUL(this, LOGL_DEBUG,
                                  "BSN %d already received\n", rdbi->bsn);
                        continue;

Actions #6

Updated by pespin 9 months ago

  • % Done changed from 50 to 90

osmo-pcu port of libosmo-gprs-rlcmac patch:
remote: https://gerrit.osmocom.org/c/osmo-pcu/+/33922 gprs_rlc_ul_window: Make sure V(N) array is cleared during reset_state()
remote: https://gerrit.osmocom.org/c/osmo-pcu/+/33924 gprs_rlc_ul_window: Mark received BSNs falling out of the V(N)/RBB when V(R) ...

ttcn3 test:
https://gerrit.osmocom.org/c/osmo-ttcn3-hacks/+/33923 pcu: Introduce test TC_ul_tbf_bsn_wraparound_gprs

Actions #7

Updated by pespin 9 months ago

  • Status changed from Feedback to Resolved
  • % Done changed from 90 to 100

Fixed & merged, closing.

Actions

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)