Project

General

Profile

Actions

Bug #6375

closed

default TCP user timeout is way too low

Added by laforge about 2 months ago. Updated 10 days ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
-
Start date:
02/24/2024
Due date:
% Done:

100%

Spec Reference:

Description

The default USERT_IMEOUT for TCP keep-alive (which is active by default) should have been:

val = 1000 * line->keepalive_num_probes * line->keepalive_probe_interval + line->keepalive_idle_timeout;

Which should have been 1000 * 10 * 3 + 30 which expands to roughly 30 seconds.

However, I guess there's two bugs in the code, looking at it:

  1. there should have been parenthesis around the + operator (line->keepalive_probe_interval + line->keepalive_idle_timeout) as the keepalive_idle_timeout is in seconds, not milli-seconds.
  2. in the default case, all those values are configured to -1 (E1INP_USE_DEFAULT). This means we're using 1000 * -1 * -1 + -1 = 999, i.e. just below aq second which clearly is not enough for a lossy satellite back-haul.

This is actaully rather critical, a proposed untested fix is in https://gerrit.osmocom.org/c/libosmo-abis/+/36079


Files

testsuite-netem-1200ms.pcap testsuite-netem-1200ms.pcap 200 KB fixeria, 03/17/2024 08:30 PM
osmo-bts-netem-1200ms.pcap osmo-bts-netem-1200ms.pcap 483 KB fixeria, 03/17/2024 08:30 PM
osmo-bsc-netem-1200ms.pcap osmo-bsc-netem-1200ms.pcap 289 KB fixeria, 03/17/2024 08:58 PM

Checklist

  • add log output for the timeout values when they are applied
  • back-port + patch releases for 1.4.x and 1.5.x
  • add a news entry

Related issues

Related to OsmoBSC - Feature #4541: add ttcn3 tests for situations of Abis latency spikes, where chan act/deact ACK messages arrive lateNew05/09/2020

Actions
Actions #1

Updated by laforge about 2 months ago

  • Checklist item add log output for the timeout values when they are applied added
  • Checklist item automatic testing with non-zero packet-loss/latency/jitter added

what we learn from this:

  • our automatic testing is insufficient. We are not ever testing over realistic non-terrestrial back-haul latencies. We should try running at least some of our BTS / BSC tests over an Abis link with simulated latency and non-zero packet loss
  • it's not a good idea to make a feature like this the default without actually even manually testing once if the timeout actually expires after the expected 30s, instead of the current < 1s
  • intermediate computed values should potentially be printed/logged
Actions #2

Updated by laforge about 2 months ago

  • Checklist item back-port + patch releases for 1.4.x and 1.5.x added
  • Assignee changed from laforge to osmith
Actions #3

Updated by osmith about 2 months ago

  • % Done changed from 0 to 20

automatic testing with non-zero packet-loss/latency/jitter

I'm looking into reproducing the bug by running the ttcn3 testsuite with netem, and adding a separate jenkins job that runs the testsuite with netem. Then we can see if there are additional failures, and compare with the regular run of the testsuite.

Actions #4

Updated by osmith about 2 months ago

  • Related to Feature #4541: add ttcn3 tests for situations of Abis latency spikes, where chan act/deact ACK messages arrive late added
Actions #5

Updated by laforge about 2 months ago

  • Status changed from In Progress to Resolved
  • % Done changed from 20 to 100
Actions #6

Updated by laforge about 2 months ago

  • Checklist item add log output for the timeout values when they are applied set to Done
  • Status changed from Resolved to In Progress
  • % Done changed from 100 to 30
Actions #7

Updated by osmith about 2 months ago

  • % Done changed from 30 to 40

I'm having a hard time getting e.g. osmo-bts into a state where it loses the abis link.

What I have tried:
  • use netem to just slow down a specific test, which should in theory result in losing the TCP connection if the delay is >= 1s and the fix is not applied to libosmo-abis
  • picked TC_est_dchan from the BTS testsuite, which has some abis traffic
  • however this resulted in hitting unrelated timeouts in the testsuite, so I have tried another approach: writing a separate ttcn3 test that adds a delay in the TCP connection to get into the error state. As I looked closer into this, I realized that it is probably quite some effort to do this at the TCP level with our TTCN-3 testsuite. So I went back to the netem approach.
  • I have adjusted a timeout in the testsuite that caused the unrelated failure, and then the test took significantly longer and was able to pass with an artificial delay of 900 ms, 1000ms, 1500ms.
  • Unfortunately the test also passes with the delay if libosmo-abis is not patched...

So looking further into reproducing it. My plan is to make the patch releases after verifying that the default timeout now works as intended.

EDIT: tc command I'm using: "tc qdisc add dev eth0 root netem delay 1500ms"

Actions #8

Updated by osmith about 1 month ago

  • Assignee changed from osmith to fixeria

Assigning the ticket to Vadim, as discussed.

Actions #9

Updated by laforge 30 days ago

can we please get the back-ports to 1.4.x and 1.5.x and related news items out of the door soon-ish?

Actions #10

Updated by fixeria 29 days ago

  • % Done changed from 40 to 50

laforge wrote in #note-9:

can we please get the back-ports to 1.4.x and 1.5.x and related news items out of the door soon-ish?

Below are the related patches:

v1.4.2

https://gerrit.osmocom.org/c/libosmo-abis/+/36286 [cosmetic] input/ipaccess.c: We use tab indent, not spaces [wtf] [NEW]
https://gerrit.osmocom.org/c/libosmo-abis/+/36277 input/ipaccess: Make sure to say "TCP keepalive" [NEW]
https://gerrit.osmocom.org/c/libosmo-abis/+/36278 Fix critical bug in default TCP keepalive user timeout [NEW]
https://gerrit.osmocom.org/c/libosmo-abis/+/36279 Bump version: 1.4.1.3-b8a4 → 1.4.2 [NEW]

v1.5.2

https://gerrit.osmocom.org/c/libosmo-abis/+/36280 [cosmetic] input/ipaccess.c: We use tab indent, not spaces [wtf] [NEW]
https://gerrit.osmocom.org/c/libosmo-abis/+/36281 input/ipaccess: Make sure to say "TCP keepalive" [NEW]
https://gerrit.osmocom.org/c/libosmo-abis/+/36282 Fix critical bug in default TCP keepalive user timeout [NEW]
https://gerrit.osmocom.org/c/libosmo-abis/+/36283 Bump version: 1.5.1.3-0eed → 1.5.2 [NEW]

Actions #11

Updated by fixeria 28 days ago

  • Checklist item back-port + patch releases for 1.4.x and 1.5.x set to Done

All patches have been merged, patch releases tagged.

You mentioned the news items, so here is a template for review:

A critical bug has been discovered in libosmo-abis: the calculation of the TCP keepalive user timeout value was incorrect. This issue remained unnoticed for quite some time, ever since the initial support for TCP_USER_TIMEOUT was merged back in Jun 2019 (libosmo-abis >= 0.7.0). However, since it was not yet enabled by default at that time, only users with explicit TCP keepalive configuration in their config files would be affected. The TCP keepalive was later enabled by default in Nov 2022 (libosmo-abis >= 1.4.0), impacting all users and resulting in a significantly shorter TCP_USER_TIMEOUT value than expected.

The problem has been fixed in master. Patch releases tagged for two latest minor versions:

Release Tag
1.4.2 https://gitea.osmocom.org/osmocom/libosmo-abis/releases/tag/1.4.2
1.5.2 https://gitea.osmocom.org/osmocom/libosmo-abis/releases/tag/1.5.2

This issue primarily impacted production users operating IPA connections (A-bis/OML, A-bis/RSL, GSUP) over links with real-life delays, while it is unlikely to have affected the lab users.

laforge is this looking good to you?

Actions #12

Updated by laforge 28 days ago

Please rephrase below:

On Sat, Mar 16, 2024 at 06:06:02AM +0000, fixeria wrote:

This issue primarily impacted production users operating IPA connections (A-bis/OML, A-bis/RSL, GSUP) over links with real-life delays, while it is unlikely to have affected the lab users.

over "links with delays with RTT approaching one second, such as overloaded/flakey wifi links or VSAT
back-haul. It is unlikely to have affected users with non-overloaded terrestrial/wired back-haul".

Apart from that: all good!

Actions #13

Updated by fixeria 28 days ago

  • Checklist item add a news entry added
Actions #14

Updated by fixeria 27 days ago

osmith wrote in #note-7:

I'm having a hard time getting e.g. osmo-bts into a state where it loses the abis link.

Likewise, I did some experiments and yet was unable to make TTCN-3 test cases fail by introducing artificial delays.

What I have tried:
  • use netem to just slow down a specific test, which should in theory result in losing the TCP connection if the delay is >= 1s and the fix is not applied to libosmo-abis
  • picked TC_est_dchan from the BTS testsuite, which has some abis traffic
  • however this resulted in hitting unrelated timeouts in the testsuite, so I have tried another approach: writing a separate ttcn3 test that adds a delay in the TCP connection to get into the error state. As I looked closer into this, I realized that it is probably quite some effort to do this at the TCP level with our TTCN-3 testsuite. So I went back to the netem approach.
  • I have adjusted a timeout in the testsuite that caused the unrelated failure, and then the test took significantly longer and was able to pass with an artificial delay of 900 ms, 1000ms, 1500ms.
  • Unfortunately the test also passes with the delay if libosmo-abis is not patched...

So looking further into reproducing it. My plan is to make the patch releases after verifying that the default timeout now works as intended.

EDIT: tc command I'm using: "tc qdisc add dev eth0 root netem delay 1500ms"

Below is what I did:

  • modify the the ttcn3-bts-test image to include the iproute2 package, which provides the tc-netem
  • build the osmo-bts-master image with an old/unpatched version of libosmo-abis.git (v1.5.1)
  • enable the TCP keepalive, which is disabled by default in osmo-bts.cfg
diff --git a/ttcn3-bts-test/generic/osmo-bts.confmerge b/ttcn3-bts-test/generic/osmo-bts.confmerge
index 9015904..17bd5ab 100644
--- a/ttcn3-bts-test/generic/osmo-bts.confmerge
+++ b/ttcn3-bts-test/generic/osmo-bts.confmerge
@@ -1,5 +1,7 @@
 line vty
  bind 0.0.0.0
+e1_input
+ e1_line 0 keepalive
 phy 0
  osmotrx ip local 172.18.9.20
  osmotrx ip remote 172.18.9.21
  • patch ttcn3-bts-test/jenkins.sh
    • to use custom images (note the :netem tag)
    • to execute ttcn3-bts-test with --cap-add=NET_ADMIN (needed for tc-netem)
    • to configure an artificial delay before starting ttcn3-bts-test
diff --git a/ttcn3-bts-test/jenkins.sh b/ttcn3-bts-test/jenkins.sh
index f932b4a..e374044 100755
--- a/ttcn3-bts-test/jenkins.sh
+++ b/ttcn3-bts-test/jenkins.sh
@@ -50,7 +50,7 @@ start_bts() {
                        -e "SLEEP_BEFORE_RESPAWN=$sleep_time_respawn" \
                        --name ${BUILD_TAG}-bts -d \
                        $DOCKER_ARGS \
-                       $REPO_USER/osmo-bts-$IMAGE_SUFFIX \
+                       $REPO_USER/osmo-bts-$IMAGE_SUFFIX:netem \
                        /bin/sh -c "/usr/local/bin/respawn.sh osmo-bts-$variant -c /data/osmo-bts.gen.cfg >>/data/osmo-bts.log 2>&1" 
 }

@@ -117,8 +117,10 @@ start_testsuite() {
                        -v $VOL_BASE_DIR/bts-tester-${variant}:/data \
                        -v $VOL_BASE_DIR/unix:/data/unix \
                        --name ${BUILD_TAG}-ttcn3-bts-test \
+                       --cap-add=NET_ADMIN \
                        $DOCKER_ARGS \
-                       $REPO_USER/ttcn3-bts-test
+                       $REPO_USER/ttcn3-bts-test:netem \
+                       /bin/sh -c "tc qdisc add dev eth0 root netem delay 1000ms;  ttcn3-docker-run bts BTS_Tests;" 
 }

 set_pcuif_version() {

With that, I am able to execute BTS_Tests.TC_est_dchan and it's passing reliably (without any additional changes) with 500ms, 1000ms, 1500ms.
I confirmed the delay is applied correctly by running ping. The testcase execution time also increases significantly.

As a side effect, for some unknown reason, the testsuite fails to start tcpdump with those modifications:

fixeria@fixeria-dev:~/docker-playground/ttcn3-bts-test$ ls -la /tmp/tmp.IjrR8l78tv/bts-tester-generic/
total 7732
drwxr-xr-x  3 fixeria fixeria    4096 Mar 17 08:04 .
drwxr-xr-x 11 fixeria fixeria    4096 Mar 17 08:04 ..
-rw-r--r--  1 fixeria fixeria     751 Mar 17 08:04 BTS_Tests.cfg
-rw-r--r--  1 root    root    7882935 Mar 17 08:04 BTS_Tests.TC_est_dchan.merged
-rw-r--r--  1 root    root          0 Mar 17 08:02 BTS_Tests.TC_est_dchan.netcat.stderr
-rw-r--r--  1 root    root        151 Mar 17 08:02 BTS_Tests.TC_est_dchan.pcap.stderr
-rw-r--r--  1 root    root        151 Mar 17 08:02 BTS_Tests.TC_est_dchan.pcap.stdout
-rw-r--r--  1 root    root          0 Mar 17 08:04 BTS_Tests.TC_est_dchan.talloc
-rw-r--r--  1 root    root        204 Mar 17 08:04 junit-xml-22.log
drwxr-xr-x  2 root    root       4096 Mar 17 08:02 unix
fixeria@fixeria-dev:~/docker-playground/ttcn3-bts-test$ cat /tmp/tmp.IjrR8l78tv/bts-tester-generic/BTS_Tests.TC_est_dchan.pcap.stderr
Capturing on 'any'
dumpcap: The file to which the capture would be saved ("/data/BTS_Tests.TC_est_dchan.pcap") could not be opened: Permission denied.
fixeria@fixeria-dev:~/docker-playground/ttcn3-bts-test$ cat /tmp/tmp.IjrR8l78tv/bts-tester-generic/BTS_Tests.TC_est_dchan.pcap.stdout 
Capturing on 'any'
dumpcap: The file to which the capture would be saved ("/data/BTS_Tests.TC_est_dchan.pcap") could not be opened: Permission denied.
Actions #15

Updated by fixeria 27 days ago

fixeria wrote in #note-14:

With that, I am able to execute BTS_Tests.TC_est_dchan and it's passing reliably (without any additional changes) with 500ms, 1000ms, 1500ms.

I did some experiments, running osmo-bsc and osmo-bts with TCP keepalive locally (not in Docker) and I was able to reproduce the problem: osmo-bts does terminate the A-bis link.

Actions #16

Updated by fixeria 26 days ago

I was able to record a few PCAPs (by manually running tcpdump in containers), attaching them for further analysis.
One is recorded on the IUTs side, another is on the testsuite's side.

Actions #17

Updated by fixeria 26 days ago

fixeria wrote in #note-15:

I did some experiments, running osmo-bsc and osmo-bts with TCP keepalive locally (not in Docker) and I was able to reproduce the problem: osmo-bts does terminate the A-bis link.

Ok, I was also able to reproduce the problem in Docker by introducing artificial delay on the A-bis/OML (not RSL) link. This was a bit tricky, because having the delay during the OML bootstrapping would slow down the RSL connection establishment a lot and cause the guard timer to fire, so this had to be done manually in the right time. Attaching another PCAP (recorded on the BSC's side) for analysis.

I think it does not affect the RSL link because, compared to the OML link, it's not idling for enough time. At least the RF RESource INDication is being sent every so often for each TRX instance. The OML link, on the other hand, remains idle most of the time after the bootstrapping is completed.

Actions #19

Updated by fixeria 25 days ago

Yesterday I gave ttcn3-bts-test a spin with simulated delay of 600ms on the A-bis/RSL interface.

fixeria@fixeria-dev:/tmp/tmp.VKcOSQ8db6/bts-tester-generic$ ls -lah | grep TC_chan_act_stress
-rw-r--r--  1 root    root     870K Mar 18 21:20 BTS_Tests.TC_chan_act_stress.merged
-rw-r--r--  1 root    root        0 Mar 18 19:27 BTS_Tests.TC_chan_act_stress.netcat.stderr

Judging by this output, executing generic profile of the testsuite took ~2h. Executing the hopping profile would be +2h, so ~4h in total.
According to Jenkins, right now it takes ~2h to execute both profiles, so adding a delay of 600ms would roughly double the testsuite execution time.

As was agreed during the Jitsi call today, I will experiment further with lower delay values (<100ms) and other testsuites.

Actions #20

Updated by fixeria 23 days ago

  • Checklist item deleted (automatic testing with non-zero packet-loss/latency/jitter)
  • Status changed from In Progress to Resolved
  • % Done changed from 60 to 100

I created a separate ticket #6411 for the TTCN-3 testing, this ticket can be closed.

Actions

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)