Project

General

Profile

Actions

Bug #5774

open

Drop pcu_sock conn if queue length grows up to a given threshold

Added by pespin over 1 year ago. Updated 4 months ago.

Status:
Feedback
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
11/18/2022
Due date:
% Done:

90%

Spec Reference:

Description

<pespin_> we seem to have some memleak or mem-kept problem with osmo-bts-trx after a while when the pcu is paused in gdb
<pespin_> my osmo-bts-trx was taking 35% of my 16GB RAM
<LaF0rge> pespin_: well, probably lots of primitives queued for the pcu socket ;)
<pespin_> probably osmo-bts-trx sees the PCU socket still alive (process paused by gdb) and keeps queueing stuff forever in the PCUIF queue
<pespin_> yeah
<fixeria> maybe it's queueing the PCUIF messages somewhere?
<pespin_> we should block the queue from growing like that
<pespin_> keep it at a meaningful maximum size and drop older messages when adding new ones if the threshold is reached
<pespin_> I'll create a ticket to remind us
<fixeria> libosmocore's write_queue allows to set the limit, AFAIR
<pespin_> yeah but it's probably not dropping the old ones
<Hoernchen_> just drop all, no point trying to fix it by keeping "recent" messages, since the whole relationship of messages is fucked up anyway
<fixeria> but it will be rejecting new messages, not old
<pespin_> Hoernchen_, still it may be able to recover that way
<Hoernchen_> how?
<pespin_> just by keeping processing new messages?
<Hoernchen_> yeah,new ones, but why bother with the ones in the queue?
<pespin_> that's why I said "up to a meaningul length" 
<Hoernchen_> it just adds complexity keping track without actually affecting the probability of the "catching up" part after missing a lot of messages
<pespin_> no need to keep 1000 of them
<fixeria> the messages are queued because the fd is not writeable, so by processing out-of-queue you'll simply block the process
<pespin_> not much complexity: if (len > 10), llist_del(list_head->next); llist_add_tail(new_msg)
<LaF0rge> i would probably simply close the pcu_socket if we detect long queue lengths
<LaF0rge> let the pcu reconnect if it recovers/restarts
<LaF0rge> so have a osmo_wqueue, and if it overflows, close the fd, done.
<LaF0rge> [and flush the queue of course when closing]
<pespin_> ack makes sense

So let's establish a good threshold (configurable to allow debugging pcu without breaking the conn?) at which osmo-bts-trx can decide the PCU is stuck and can close the conn and drop the messages until it can reconnect again successfuly.


Files

osmo-bsc_fixed.cfg osmo-bsc_fixed.cfg 3.17 KB arehbein, 02/21/2023 04:10 PM
osmo-bts-trx.cfg osmo-bts-trx.cfg 1.29 KB arehbein, 02/21/2023 04:10 PM
osmo-pcu.cfg osmo-pcu.cfg 490 Bytes arehbein, 02/21/2023 04:10 PM
Actions #1

Updated by laforge about 1 year ago

  • Assignee changed from 4368 to arehbein
Actions #2

Updated by arehbein about 1 year ago

  • Status changed from New to In Progress

Hello pespin,

two points I'd like to have some feedback on:

1. Which component/config combinations do I need to start in order to reproduce this scenario? (Any commands to trigger the queue buildup?)

Here are the commands I ran and some output:

osmocom-bb/src/target/trx_toolkit$ ./fake_trx.py -R 127.0.0.1

osmo-bsc$ ./src/osmo-bsc/osmo-bsc -c doc/examples/osmo-bsc/osmo-bsc-4trx.cfg

<0022> ../../src/m3ua.c:508 XUA_AS(as-clnt-A-0-m3ua)[0x612000010d20]{AS_DOWN}: Event AS-TRANSFER.req not permitted

osmo-pcu$ ./src/osmo-pcu -c ./doc/examples/osmo-pcu.cfg

<0001> osmobts_sock.c:238 osmo-bts PCU socket /tmp/pcu_bts has been connected
<0001> pcu_l1_if.cpp:1119 Received message for new BTS0
<0001> pcu_l1_if.cpp:732 BTS not available

osmo-bts$ ./src/osmo-bts-trx/osmo-bts-trx -c ./doc/examples/trx/osmo-bts-trx.cfg

<0012> ../../src/input/ipaccess.c:1098 enabling ipaccess BTS mode, OML connecting to 192.168.122.1:3002
<000b> trx_if.c:1266 phy0.0: Opening TRXC/TRXD connections
<0009> pcu_sock.c:1191 PCU socket connected to external PCU
<0001> oml.c:90 OC=GPRS-CELL INST=(00,ff,ff): Sending PCU version report to BSC: 1.2.0.1-d8ce
<0001> oml.c:90 OC=GPRS-CELL INST=(00,ff,ff): Sending PCU version report to BSC: 1.2.0.1-d8ce
<0001> oml.c:90 OC=GPRS-CELL INST=(00,ff,ff): Sending PCU version report to BSC: 1.2.0.1-d8ce
...

Seems like the BTS doesn't advertise itself as available for whatever reason. I'm not sure what's going on/if that means anything for this testcase etc., would appreciate some help with setting up the case referred to in the issue description, plus the approximate amount of time (i.e. magnitude) it took to build up that RAM usage.

2. Current approach

I thought I'd ask before coding away in this case... especially since I haven't yet been able to reproduce the error and am currently going on what I've read in the code/here.

This is what I'm thinking of doing (more of a rough sketch with possible errors than a solution):

  • make member upqueue of struct pcu_sock_state a struct osmo_wqueue, set the maximum size of the queue/make it configurable
  • set the read/write callbacks for the wqueue (the logic for that probably already exists and I'll just have to put it in a callback function... it'll probably be a function quite similar to pcu_sock_send())
  • adapt queueing/dequeueing in the code to reflect that change in data type and by adapting the fd setup in the PCU socket code of the BTS
  • if enqueuing fails with -ENOSPC, close the PCU socket

Please tell me if that is the right kind of approach or if I should be doing something else.

Actions #3

Updated by arehbein about 1 year ago

Additional note:

The messages

oml.c:90 OC=GPRS-CELL INST=(00,ff,ff): Sending PCU version report to BSC: 1.2.0.1-d8ce

might also be a/the cause of memory building up. So in theory some changes could be made concerning that. From the short look I've taken so far, this might however not be that simple, since it boils down to abis_sendmsg(msg) further up the callstack.

Actions #4

Updated by arehbein about 1 year ago

  • Status changed from In Progress to Feedback
  • % Done changed from 0 to 20
Actions #5

Updated by pespin about 1 year ago

1. You probably don't have gprs enabled in osmo-bsc.cfg for that BTS? Can you share your config files?
2. Fine, except that in case the queue is full, then drop the first item in the list before adding the new one at the end. This way when the other side resumes working we still continue working (useful for debugging for instance).

Actions #6

Updated by arehbein about 1 year ago

  • File osmo-bts-trx.cfg added
  • File osmo-pcu.cfg added
  • File osmo-bsc-4trx.cfg added

Hi pespin,

these are the configs. Took a short look and yes, it says gprs mode none.

Can you refer me to a working example of the components needed for this test configured with gprs enabled (and then what to do for the allocations to build up?)

Thank you

Actions #7

Updated by pespin about 1 year ago

Hi @arehbein ,

Have a look at OsmoBSC's User Manual (https://ftp.osmocom.org/docs/latest/osmobsc-usermanual.pdf) "9.5 Configuring GPRS PCU parameters of a BTS". Make sure you also set up some "phys_chan_config" as "PDCH" (eg. TS7), this will ensure scheduler messages are sent to the PCU.

As on how to reproduce, it's explained in the first line of the ticket:

<pespin_> we seem to have some memleak or mem-kept problem with osmo-bts-trx after a while when the pcu is paused in gdb

So run the whole setup, with osmo-pcu under gdb, then CTRL+C to pause the process, and wait a bit (a few mins), osmo-bts-trx should keep growing its memory usage. You should be able to tell by getting a talloc report of osmo-bts-trx by using VTY or sending a -SIGUSR1 (or SIGUSR2?) signal to osmo-bts-trx (you should see an ever increasing number of messages being allocated and queued).

Actions #8

Updated by arehbein about 1 year ago

Actions #9

Updated by arehbein about 1 year ago

  • % Done changed from 40 to 80

I have just uploaded two patches (one with the fix, the other with the config stuff separate):

Main fix:
https://gerrit.osmocom.org/c/osmo-bts/+/31533

Config/vty:
https://gerrit.osmocom.org/c/osmo-bts/+/31534

The memleak on PCU pause via gdb was gone, didn't observe any other memory building up either on open-bts-trx in the described case.

Actions #10

Updated by arehbein about 1 year ago

  • % Done changed from 80 to 90
Actions #11

Updated by arehbein 9 months ago

Currently waiting for some help with running osmo-bsc connected to a dummy/osmo-pcu via PCU IF exchanging info ind. messages (see comment here: https://gerrit.osmocom.org/c/osmo-bts/+/31533/comments/0304a545_a5695652), so I can try out the analogous change for osmo-bsc.

Actions #12

Updated by pespin 7 months ago

  • Assignee changed from arehbein to dexter

dexter can you probably give a try to this to see if the queueing is working fine? I believe you are the only one with an RBS setup right now? The patch is merged, so it's probablyu only about a quick test with osmo-bsc master.

Actions #13

Updated by arehbein 7 months ago

pespin wrote in #note-12:

dexter can you probably give a try to this to see if the queueing is working fine? I believe you are the only one with an RBS setup right now? The patch is merged, so it's probablyu only about a quick test with osmo-bsc master.

It's not merged, the changes for osmo-bsc are https://gerrit.osmocom.org/c/osmo-bsc/+/33891 and https://gerrit.osmocom.org/c/osmo-bsc/+/33892 . Dexter gave me some pointers for trying out the bug (I wrote him a mail before).

However, I still don't have a VPN certificate yet for remote work and since I have a slight cold, people in the office will probably not want me around.

Actions #14

Updated by dexter 7 months ago

(I will have a look at this now.)

Actions #15

Updated by dexter 7 months ago

I have now carefully tested both patches. There were no problems in general. I also can confirm that OsmoBSC closes the PCU socket after a while when OsmoPCU is paused.

Actions #16

Updated by dexter 7 months ago

  • Assignee changed from dexter to arehbein
Actions #17

Updated by arehbein 6 months ago

  • Status changed from In Progress to Feedback
Actions #18

Updated by laforge 4 months ago

  • Assignee changed from arehbein to jolly
Actions

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)