Project

General

Profile

Bug #4974

gbproxy-ttcn3-test over framerelay are unstable

Added by daniel 4 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
libosmogb
Target version:
-
Start date:
01/25/2021
Due date:
% Done:

100%

Spec Reference:

Description

Even though it appears in gbproxy I think this is NS2-related.

Lots of tests started failing randomly around Job #92, see:
https://jenkins.osmocom.org/jenkins/view/TTCN3/job/ttcn3-gbproxy-test-fr/test_results_analyzer/

The non-fr tests are quite stable so this has something to do with either NS2 or the ttcn3 implementation of NS over FR.

The failure is usually some timeout waiting for BSSGP on PCU side

Timeout waiting for BSSGP on PCU side: { pDU_BSSGP_DL_UNITDATA := { bssgpPduType := '00'O, tLLI_current := 'C2180023'O, qoS_Profile := ?, pDU_Lifetime := ?, mS_Radio_Access_Capability := *, priority := *, dRX_Parameters := *, iMSI := { iEI := '0D'O ("\r"), ext := '1'B, lengthIndicator := ?, type_of_Identity := '001'B, oddevenIndicator := ?, digits := '262420000000000'H }, tLLI_old := *, pFI := *, lSA_Information := *, service_UTRAN_CCO := *, service_Class_Indicator := *, subscriber_Profile_ID_For_RAT_Priority := *, redirection_Indication := *, redirection_Completed := *, unconfirmed_Send_State_Variable := *, sCI := *, gGSN_PGW_Location := *, eDRX_Paremeters := *, old_Routing_Area_Identification := *, attach_Indicator := *, alignment_octets := *, lLC_PDU := { iEI := '0E'O, ext := ?, lengthIndicator := ?, lLC_PDU := 'DAA1113E56FAE9C68FA81DA1'O }, initialLLC_PDU := * } }
      GBProxy_Tests.ttcn:3297 GBProxy_Tests control part
      GBProxy_Tests.ttcn:1141 TC_dl_unitdata testcase
[...]
Timeout waiting for BSSGP on PCU side: { pDU_BSSGP_FLOW_CONTROL_MS_ACK := { bssgpPduType := '29'O (")"), tLLI := { iEI := '1F'O, ext := '1'B, lengthIndicator := { length1 := 4 }, tLLI_Value := 'C220040B'O }, tag := { iEI := '1E'O, ext := '1'B, lengthIndicator := { length1 := 1 }, unstructured_Value := '12'O } } }
      GBProxy_Tests.ttcn:3374 GBProxy_Tests control part
      GBProxy_Tests.ttcn:2775 TC_fc_ms testcase

gbproxy-fr-bvci_unknown.pcapng gbproxy-fr-bvci_unknown.pcapng 13.9 KB laforge, 02/12/2021 03:38 PM

Related issues

Related to libosmocore - Bug #4995: handle ENOBUFS on write to AF_PACKET socketResolved01/30/2021

Related to libosmocore - Bug #5023: race condition in NS over IP sub-networkResolved02/13/2021

Associated revisions

Revision 8373c05a (diff)
Added by laforge 3 months ago

gb: frame_relay: Detect link outage on "last receive seq nr == 0"

this is a bit of a hack. Q.933 explicitly forbids either side from ever
sending a sequence number of '0'. Values start from '1' and are modulo 256,
but '0' is always skipped. So if the peer is sending us a "last received
sequence number of '0' it means it has not yet received any packets from us,
which in turn can only mean that it has just been restarted. Let's treat
this as "service affecting condition" and notify upper layers. This helps
particularly in recovering from rapidly re-starting peers, where the Q.933
nor NS have time to actually detect the connection was lost.

Change-Id: I960a7b17f2550cb49a7b9d72ed87cd271bb64122
Related: OS#4974

Revision 09422686 (diff)
Added by laforge 3 months ago

BVC FSM: Treat overlapping BVC-RESET as implicit ACK

If the BSS (or SGSN) has sent a BVC-RESET PDU for a BVCI to the SGSN (or
BSS) and is awaiting a BVC-RESET- ACK PDU in response, but instead
receives a BVC-RESET PDU indicating the same BVCI, then this shall be
interpreted as a BVC-RESET ACK PDU and the T2 timer shall be stopped.

Related: OS#4974
Change-Id: I4d15733f9f205cb563b66ef9e41dc8df50151900

History

#1 Updated by laforge 4 months ago

Thnaks for keeping a look at it and creating the ticket, Daniel. I've
also seen that the last four builds were particularly bad in terms of
test results.

Maybe it would be some idea to start with a lower-level test of some sort?

Like some small C-language test porograms that uses libosmocore NS2 in
both SGSN and PCU side and which then test only that part,
including e.g. the load sharing over multiple NS-VC.

With the debian10-fr VM image everyone should be able to also execute
that locally without any requirement to fsck up their personal laptop
or with relying on jenkins.

For the TTCN3 FR code base we do have the fr + frnet TTCN3 tests that
serve a similar pusrpose: To self-test the TTCN3 FR test port and the NS
emulation. I will set-up two jenkins jobs to automatically execute
them:
  1. with one hdlc-netdev (no load sharing)
  2. with four hdlc-netdev (load sharing like we do in GBProxy_Tests.ttcn)

Once those are set-up, I will run them a couple of times on the jenkins
slave and see. If those run stable, the problem is likely either on the
"C" language side NS2, or in the GBProxy_Test itself.

Leaving this ticket assigned to lynxis for the libosmogb / ns2 related
work.

#2 Updated by daniel 4 months ago

#3 Updated by laforge 3 months ago

  • Related to Bug #4995: handle ENOBUFS on write to AF_PACKET socket added

#4 Updated by laforge 3 months ago

Update from my side:
  • I've fixed the ENOBUFS issue on the TTCN3 side
  • FR_Tests has been adjusted to actually check results and terminate
  • I have been unable to get the TTCN3 NS/BSSGP implementation to work reliably under high load. The problem appears to be that messages from NS_Emulation are not processed (fast enough?) by BSSGP_Emulation, which basically causes NS_Emulation to allocate more and more buffers for its output port and eventually give up.
  • I switched to some load testing on the C side with load generation and mirroring in osmo-ns-dummy
    • this resulted in quite a lengthy patch seires (now merged) to libosmocore

I will now get back to wokring on the TTCN3 FR_Tests / FRNET_Tests. Maybe first get them running and automatically executed by jenkins at low load (like GBProxy_Tests being on very low load), to see if they are more stable or not.

#5 Updated by laforge 3 months ago

just an update on the osmo-ns-dummy related load teseting:
  • I tested with one FR/DLC/NS-VC and with 4x E1/FR/DLC/NS-VC
    • load sharing works as expected when using randomized LSP
    • packets start to drop only for non-signalling traffic, as expected

#6 Updated by lynxis 3 months ago

  • Assignee changed from lynxis to laforge

#7 Updated by laforge 3 months ago

  • Status changed from New to In Progress
  • % Done changed from 0 to 30

lynxis I still see the responsibility for testing C-against-C for stability on your side.

I've now finally managed to get TTCN3-against-TTCN3 (FR_Tests against FRNET_Tests) completed. The resulting jenkins job is at https://jenkins.osmocom.org/jenkins/view/TTCN3/job/ttcn3-fr-test/ - so far it always runs reliably - NS and all 10 BVC come up as expected, and (at least) uplink-unitdata from 100 simulated UE is flowing.

I still have to extend the tests to
  • do bi-directional testing
  • make sure the FRNET_Tests junit xml doesn't confuse the results
  • use multiple E1/FR interfaces in parallel within one NSE
  • use multiple NSE in parallel (as they have no relation I doubt it changes anything, though)

Both in the automatic execution in jenkins (5 builts so far) and in my many manuals runs today I have not seen any sporadic failures. So I think at least TTCN3-to-TTCN3 with 1 NSVC in 1 NSE is a stable base line.

#8 Updated by laforge 3 months ago

Ok, so https://jenkins.osmocom.org/jenkins/view/TTCN3/job/ttcn3-fr-test/test_results_analyzer/ is running quite stable now, alrady 22 builds without any failures.

I've locally seen some sporadic failures between fr and fr-net, whihc are explained as follows:
  • FRNET keeps running (simulationg the IUT)
    • states is kept across TC executions
  • FR keeps re-spawning again and again, once for each test case
    • all state is clean at start of every TC
Now what happens in some rare cases is
  1. FRNET did not yet realize that all of the NS-VCS are dead (respective NS timers not timed out yet), so it thinks all NS-VC are alive still
  2. FR resets one NS-VC and then sends the BVC-RESET on top of it
  3. FRNET responds with a BVC-RESET-ACK, but sends it ove a different NS-VC
    • only FRNET thinks that NS-VC is still alive
    • FR considers this still as DEAD
    • -> boom.
So we either have
  • reduce the NS alive/test timers to a very short amount, or
  • wait sufficiently long before executing the next test case, so we're sure all timeouts have happened
  • somehow control FRNET to force all state to reset
  • reset all NS state when FR detects link outage
    • but then we have the same problem with LMI: If we re-spawn faster than LMI detects an outage, ...

#9 Updated by laforge 3 months ago

I've now switched to testing a loop of FR_Tests.TC_bvc_bringup against osmo-gbproxy. It works only exactly once.

The reasons are probably quite similar: osmo-gbproxy does not detect a link outage between the successive test executions, both Q.933 and NS assume the NS-VCs are in good shape, while the TTCN3 side assumes everything is dead+blocked when starting up.

If one waits long enough to see

<0026> frame_relay.c:430 hdlcnet1: Link failed
<0026> frame_relay.c:199 hdlcnet1: DLCI 201 became inactive
<0026> frame_relay.c:430 hdlcnet2: Link failed
<0026> frame_relay.c:199 hdlcnet2: DLCI 202 became inactive
<0026> frame_relay.c:430 hdlcnet3: Link failed
<0026> frame_relay.c:199 hdlcnet3: DLCI 203 became inactive
<0026> frame_relay.c:430 hdlcnet4: Link failed
<0026> frame_relay.c:199 hdlcnet4: DLCI 204 became inactive

in the logs, then the next attempt succesds again.

We could add the same "hack" of treating the Q.933 seuqence number '0' as forced service-affecting-condition. I'll see what can be done.

#11 Updated by laforge 3 months ago

even after the "force-detect-link-outage" changes described above, I can still generate plenty of failed test results.

One example is TC_flush_ll, where the TTCN3 side suddenly sends a NS-RESET while the test is running:

Initially it works fine: Everything comes up and flush-ll + ack go back and forth

<0004> gb_proxy.c:1154 NSE(00102/SGSN)-BVC(00000/??) Rx FLUSH-LL
<0004> gb_proxy.c:192 NSE(00102/SGSN)-BVC(00000/??) proxying to NSE(02001/BSS)
<0026> frame_relay.c:846 hdlcnet1: DLCI 16 is active, sending message
<0026> gprs_ns2_vc_fsm.c:833 GPRS-NS2-VC(FR-hdlcnet1-DLCI16-NSE02001-NSVC00001)[0x6120000024a0]{UNBLOCKED}: Received Event UNITDATA
<0004> gb_proxy.c:907 NSE(02001/BSS)-BVC(00000/??) Rx FLUSH-LL-ACK
<0004> gb_proxy.c:192 NSE(02001/BSS)-BVC(00000/??) proxying to NSE(00102/SGSN)
<0026> gprs_ns2_vc_fsm.c:833 GPRS-NS2-VC(UDP-NSE00102-remote-127_0_0_11:8888)[0x612000010720]{UNBLOCKED}: Received Event UNITDATA
<0004> gb_proxy.c:1154 NSE(00102/SGSN)-BVC(00000/??) Rx FLUSH-LL
<0004> gb_proxy.c:192 NSE(00102/SGSN)-BVC(00000/??) proxying to NSE(02001/BSS)
<0026> frame_relay.c:846 hdlcnet1: DLCI 16 is active, sending message
<0026> gprs_ns2_vc_fsm.c:833 GPRS-NS2-VC(FR-hdlcnet1-DLCI16-NSE02001-NSVC00001)[0x6120000024a0]{UNBLOCKED}: Received Event UNITDATA
<0004> gb_proxy.c:907 NSE(02001/BSS)-BVC(00000/??) Rx FLUSH-LL-ACK
<0004> gb_proxy.c:192 NSE(02001/BSS)-BVC(00000/??) proxying to NSE(00102/SGSN)
<0026> gprs_ns2_vc_fsm.c:833 GPRS-NS2-VC(UDP-NSE00102-remote-127_0_0_11:8888)[0x612000010720]{UNBLOCKED}: Received Event UNITDATA
<0004> gb_proxy.c:1154 NSE(00102/SGSN)-BVC(00000/??) Rx FLUSH-LL
<0004> gb_proxy.c:192 NSE(00102/SGSN)-BVC(00000/??) proxying to NSE(02001/BSS)
<0026> frame_relay.c:846 hdlcnet1: DLCI 16 is active, sending message
<0026> gprs_ns2_vc_fsm.c:833 GPRS-NS2-VC(FR-hdlcnet1-DLCI16-NSE02001-NSVC00001)[0x6120000024a0]{UNBLOCKED}: Received Event UNITDATA
<0004> gb_proxy.c:907 NSE(02001/BSS)-BVC(00000/??) Rx FLUSH-LL-ACK
<0004> gb_proxy.c:192 NSE(02001/BSS)-BVC(00000/??) proxying to NSE(00102/SGSN)
<0026> gprs_ns2_vc_fsm.c:833 GPRS-NS2-VC(UDP-NSE00102-remote-127_0_0_11:8888)[0x612000010720]{UNBLOCKED}: Received Event UNITDATA
<0004> gb_proxy.c:1154 NSE(00102/SGSN)-BVC(00000/??) Rx FLUSH-LL
<0004> gb_proxy.c:192 NSE(00102/SGSN)-BVC(00000/??) proxying to NSE(02001/BSS)
<0026> frame_relay.c:846 hdlcnet1: DLCI 16 is active, sending message
<0026> gprs_ns2_vc_fsm.c:833 GPRS-NS2-VC(FR-hdlcnet1-DLCI16-NSE02001-NSVC00001)[0x6120000024a0]{UNBLOCKED}: Received Event UNITDATA

Suddenly, a wild NS-RESET from the TTCN3 tester appears

<0004> gb_proxy.c:907 NSE(02001/BSS)-BVC(00000/??) Rx FLUSH-LL-ACK
<0004> gb_proxy.c:192 NSE(02001/BSS)-BVC(00000/??) proxying to NSE(00102/SGSN)
<0026> gprs_ns2_vc_fsm.c:808 GPRS-NS2-VC(FR-hdlcnet1-DLCI16-NSE02001-NSVC00001)[0x6120000024a0]{UNBLOCKED}: Received Event RESET
<0026> gprs_ns2_vc_fsm.c:592 GPRS-NS2-VC(FR-hdlcnet1-DLCI16-NSE02001-NSVC00001)[0x6120000024a0]{UNBLOCKED}: State change to RESET (T2, 3s)
<0026> gprs_ns2.c:520 NSE(02001) NS-STATUS.ind(bvci=00000): cause=NSE failure, transfer=0, first=0
<0004> gb_proxy.c:1423 NS-NSE(02001) became unavailable
<0026> gprs_ns2_message.c:329 NSE(02001)-NSVC(00001) Tx NS RESET ACK
<0026> frame_relay.c:846 hdlcnet1: DLCI 16 is active, sending message
<0026> gprs_ns2_vc_fsm.c:296 GPRS-NS2-VC(FR-hdlcnet1-DLCI16-NSE02001-NSVC00001)[0x6120000024a0]{RESET}: State change to BLOCKED (no timeout)
<0026> gprs_ns2_message.c:347 NSE(02001)-NSVC(00001) Tx NS UNBLOCK
<0026> frame_relay.c:846 hdlcnet1: DLCI 16 is active, sending message
<0026> gprs_ns2_message.c:375 NSE(02001)-NSVC(00001) Tx NS ALIVE
<0026> frame_relay.c:846 hdlcnet1: DLCI 16 is active, sending message

FLUSH-LL is received but cannot be sent on BSS-side NS-VC?

... instead we see NS-ALIVE + UNBLOCK procedures

<0026> gprs_ns2_vc_fsm.c:833 GPRS-NS2-VC(UDP-NSE00102-remote-127_0_0_11:8888)[0x612000010720]{UNBLOCKED}: Received Event UNITDATA
<0004> gb_proxy.c:1154 NSE(00102/SGSN)-BVC(00000/??) Rx FLUSH-LL
<0004> gb_proxy.c:192 NSE(00102/SGSN)-BVC(00000/??) proxying to NSE(02001/BSS)
<0026> gprs_ns2_vc_fsm.c:826 GPRS-NS2-VC(FR-hdlcnet1-DLCI16-NSE02001-NSVC00001)[0x6120000024a0]{BLOCKED}: Received Event ALIVE
<0026> gprs_ns2_message.c:387 NSE(02001)-NSVC(00001) Tx NS ALIVE_ACK
<0026> frame_relay.c:846 hdlcnet1: DLCI 16 is active, sending message
<0026> gprs_ns2_vc_fsm.c:820 GPRS-NS2-VC(FR-hdlcnet1-DLCI16-NSE02001-NSVC00001)[0x6120000024a0]{BLOCKED}: Received Event UNBLOCK
<0026> gprs_ns2_message.c:363 NSE(02001)-NSVC(00001) Tx NS UNBLOCK_ACK
<0026> frame_relay.c:846 hdlcnet1: DLCI 16 is active, sending message
<0026> gprs_ns2_vc_fsm.c:358 GPRS-NS2-VC(FR-hdlcnet1-DLCI16-NSE02001-NSVC00001)[0x6120000024a0]{BLOCKED}: State change to UNBLOCKED (no timeout)
<0026> gprs_ns2.c:520 NSE(02001) NS-STATUS.ind(bvci=00000): cause=NSE recovery, transfer=2, first=0
<0004> gb_proxy.c:1389 NS-NSE(02001) became available

so the interesting questions are:

  1. why does the tester send the NS-RESET
  2. why does gbproxy drop the FLUSH-LL? what if there are other NS-VC still alive/unblocked?

#12 Updated by laforge 3 months ago

Ok, so, looking at the TITAN log of the above test case:

13:42:14.375314 31 NS_Provider_FR.ttcn:79 Receive operation on port FR succeeded, message from GbProxy_Test-NS(PCU[0])-NSVCI1-NSVCemu1-FRemu(32): @FrameRelay_Emulation.FRemu_Event : {
    pvc_status := {
        dlci := 16,
        new := false,
        delete := false,
        active := true
    }
} id 26
13:42:14.375336 31 NS_Provider_FR.ttcn:79 Message with id 26 was extracted from the queue of FR.
13:42:14.375734 31 NS_Provider_FR.ttcn:81 Sent on NSE to GbProxy_Test-NS(PCU[0])-NSVCI1(29) @NS_Emulation.NS_Provider_Evt : {
    link_status := NS_PROV_LINK_STATUS_UP (0)
}
13:42:14.376453 29 NS_Emulation.ttcnpp:871 Message enqueued on NSCP from GbProxy_Test-NS(PCU[0])-NSVCI1-NSVCemu1-provFR(31) @NS_Emulation.NS_Provider_Evt : {
    link_status := NS_PROV_LINK_STATUS_UP (0)
} id 25
13:42:14.377393 29 NS_Emulation.ttcnpp:605 Matching on port NSCP NS_PROV_LINK_STATUS_UP (0) with NS_PROV_LINK_STATUS_DOWN (1) unmatched: First message in the queue does not match the template: 
13:42:14.377469 29 NS_Emulation.ttcnpp:622 Timeout operation on timer Tns_alive failed: The timer is not started.
13:42:14.377492 29 NS_Emulation.ttcnpp:641 Matching on port NSCP failed: Type of the first message in the queue is not @NS_Types.PDU_NS.
13:42:14.377532 29 NS_Emulation.ttcnpp:646 Matching on port NSCP failed: Type of the first message in the queue is not @NS_Types.PDU_NS.
13:42:14.377595 29 NS_Emulation.ttcnpp:692 Matching on port NSCP succeeded: .pDU_NS_Unitdata.pDU_NS_Unitdata.pDU_NS_Unitdata.pDU_NS_Unitdata.pDU_NS_Unitdata.pDU_NS_Unitdata.pDU_NS_Unitdata.pDU_NS_Unitdata.pDU_NS_Unitdata.pDU_NS_Unitdata.pDU_NS_Unitdata.pDU_NS_Unitdata.pDU_NS_Unitdata.pDU_NS_Unitdata.pDU_NS_Unitdata.pDU_NS_Unitdata.pDU_NS_Unitdata.pDU_NS_Unitdata.pDU_NS_Unitdata.pDU_NS_Unitdata.link_status matched
13:42:14.377624 29 NS_Emulation.ttcnpp:692 Receive operation on port NSCP succeeded, message from GbProxy_Test-NS(PCU[0])-NSVCI1-NSVCemu1-provFR(31): @NS_Emulation.NS_Provider_Evt : {
    link_status := NS_PROV_LINK_STATUS_UP (0)
} id 25
13:42:14.377665 29 NS_Emulation.ttcnpp:692 Message with id 25 was extracted from the queue of NSCP.
13:42:14.377681 29 NS_Emulation.ttcnpp:693 Provider Link came up: sending NS-RESET
13:42:14.377951 29 NS_Emulation.ttcnpp:581 Sent on NSCP to GbProxy_Test-NS(PCU[0])-NSVCI1-NSVCemu1-provFR(31) @NS_Types.PDU_NS : {
    pDU_NS_Reset := {
        nsPduType := '02'O,
        causeNS := {
            iEI := '00'O,
            ext := '1'B,
            lengthIndicator := {
                length1 := 1
            },
            cause := '01'O
        },
        nS_VCI := {
            iEI := '01'O,
            ext := '1'B,
            lengthIndicator := {
                length1 := 2
            },
            nS_VCI := '0001'O
        },
        nSEI_NS := {
            iEI := '04'O,
            ext := '1'B,
            lengthIndicator := {
                length1 := 2
            },
            nSEI := '07D1'O
        }
    }
}
We can see
  • FrameRelay_Emulation sends a PVC status report about DLCI 61 with active := true, new := false
    • probably in response to receiving a STATUS (full status report) from the FRNET/SGSN side in osmo-gbproxy)
  • NS_Provider_FR turns this into a NS_PROV_LINK_STATUS_UP event
    • that's wrong, this should only happen if that DLC was not up before, e.g. by checking for new == true or the like

#13 Updated by laforge 3 months ago

After some more research, it was - of course - not that simple.

The problem was actually triggered by gbproxy sending a NS-RESET before the TTCN3 NS_Provider_FR has deemedn that both the FR link are available and the DLCI/PVC are active. The existing code simply passed such messages up to NS_Emulation, which then started to establish the NS-VC despie the underlying FR not considering the link active yet.

At some later point, the Q933 code in the FR_Emulation then finally deemed the link available + pvc active, which triggered that NS_PROV_LINK_STATUS_UP, which made the NS_Emulation do a NS-RESET again, while upper layer protocols were already acitve.

In general I try to peave the TTCN3 implementations as "open" as possible, so it still
  • allows us to send packets that are "out of spec" or during the wrong state, as that may be part of a test
  • doesn't drop incoming packets, as we normally would want to catch them and setverdict(fail) on unexpected packets

However, in this specific situation, there are some cases where a peer legitimately still sends NS-TRAFFIC while the FR link or DLC is not yet considered available/alive by the peer. FR and the rudimentary nature of Q.933 are too loose to have some kind of synchronized view of that by both sides.

So I decided to drop any received FR messages until both link + DLC are deemed available/active by the local FR_Emulation. This is implemented in https://gerrit.osmocom.org/c/osmo-ttcn3-hacks/+/22751

#14 Updated by laforge 3 months ago

  • Priority changed from High to Urgent
  • % Done changed from 30 to 80

Ok, I think we finally nailed it: https://jenkins.osmocom.org/jenkins/view/TTCN3/job/ttcn3-gbproxy-test-fr/114/ now has only 15 failures (3 more than the gbproxy test for IP). That's donw from 19..21 failures before.

I've configured the jenkins job to re-trigger itself every minute, and will leave that running over night. By tomorrow morning we should know if the results are now stable or not - and whihc tests still might need investigation.

#15 Updated by laforge 3 months ago

so, unfortunately the above-mentioned fixes only slightly improved the amount of errors.

What I tried now, is an endless loop that repeatedly calls GBProxy_Tests.TC_bvc_reset_sig_from_sgsn and run that (outside docker) against osmo-gbproxy in a FR configuration.

This passes most of the time, but occasionally it fails. In the error case, we can see
  • PCU->GBP: BVC-RESET + ACK for BVCI=0
  • PCU->GBP: BVC-RESET + ACK (BVCI=20011)
  • GBP->SGSN1: BVC-RESET (BVCI=20011)
  • GBP->SGSN2: BVC-RESET (BVCI=20011)
  • GBP<-SGSN2: BVC-RESET-ACK (BVCI=20011)
  • GBP<-SGSN1: BVC-STATUS (BVCI unknown) containing the BVC-RESET for BVCI 20011

So the TTCN3 test suite rejects the BVC-RESET from SGSN1, while accepting it from SGSN2 (!). The same subsequently happens for other PTP-BVCIs on SGSN1.

See gbproxy-fr-bvci_unknown.pcapng for the details.

#16 Updated by laforge 3 months ago

At the same time, the TTCN3 test suite also logs:

22:26:39.348266 544 BSSGP_Emulation.ttcnpp:520 setverdict(fail): none -> fail reason: "Rx BVC-RESET for unknown BVCI 20011", new component reason: "Rx BVC-RESET for unknown BVCI 20011" 

However, the SGSN0-BVCI20011 component is created 10s earler:

22:26:29.748379 BSSGP_Emulation.ttcnpp:209 Starting function f_bssgp_bvc_main({ bvci := 20011, cell_id := { ra_id := { lai := { mcc_mnc := '262F42'H, lac := 13135 }, rac := 0 }, cell_id := 30011 }, depth := BSSGP_DECODE_DEPTH_BSSGP (0), create_cb := refers(BSSGP_Emulation.DefaultCreateCallback) }, true, 101, "GbProxy_Test-BSSGP(SGSN[0])-BVCI20011") on component GbProxy_Test-BSSGP(SGSN[0])-BVCI20011(546).

and slightly later, the BSSGP_CT is asked for the PTP BVC component for 20011, and it answers successfully:

22:26:29.835962 BSSGP_Emulation.ttcnpp:540 Matching on port PROC succeeded: { bvci := 20011 with ? matched }
22:26:29.836050 BSSGP_Emulation.ttcnpp:540 Getcall operation on port PROC succeeded, call from mtc: @BSSGP_Emulation.BSSGP_get_bvci_ct : { bvci := 20011 } id 1
22:26:29.836103 BSSGP_Emulation.ttcnpp:540 Operation with id 1 was extracted from the queue of PROC.
22:26:29.836334 BSSGP_Emulation.ttcnpp:543 Replied on PROC to mtc @BSSGP_Emulation.BSSGP_get_bvci_ct : { } value GbProxy_Test-BSSGP(SGSN[0])-BVCI20011(546)

#17 Updated by laforge 3 months ago

this happens as the Signaling-BVC on the SGSN0 side never receives a BVC-RESET.
  • the BSSGP_CT component is started at 22:26:29.707870
  • the first BVC-RESET ever recevied is the BVC-RESET for the PTP-BVC at 22:26:39.340383
However, in the pcap file, I can see
  • GBP->SGSN0 BVC-RESET at 22:26:32.740069
  • GBP<-SGSN0 BVC-RESET at 22:26:32.748495
    but no related ack in either way.

So we actually have a situation where osmo-gbproxy is sending PTP BVC resets before the signaling BVC reset procedure has completed with a BVC-RESET-ACK in either direction.

#18 Updated by laforge 3 months ago

laforge wrote:

So we actually have a situation where osmo-gbproxy is sending PTP BVC resets before the signaling BVC reset procedure has completed with a BVC-RESET-ACK in either direction.

bssgp_bvc_fsm has a reset_notofication() call-back function. This is called when the FSM receives the RX_RESET event, which is normally triggered when a remote BVC-RESET is received. For an inbound BVC-RESET this makes sense, as we don't know when the BVC-RESET-ACK sent by us will arrive at the receiver, so we consider the BVC reset from the point a BVC-RESET was received.

In osmo-gbproxy, the reset_notification() call-back for BVCI=0 is implemented in sgsn_sig_bvc_reset_notif() and bss_sig_bvc_reset_notif(), respectively.

#19 Updated by laforge 3 months ago

  • Related to Bug #5023: race condition in NS over IP sub-network added

#20 Updated by laforge 3 months ago

  • % Done changed from 80 to 90

#21 Updated by laforge 3 months ago

  • Status changed from In Progress to Resolved
  • % Done changed from 90 to 100

two consecutive builds with only 6 failures (same count as the non-fr gbproxy tests). Looks like we finally resolved this one.

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)