Project

General

Profile

Actions

Bug #4378

closed

Having 'sctp-role client' in configuration causes high CPU load

Added by fixeria almost 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
-
Start date:
01/28/2020
Due date:
% Done:

100%

Spec Reference:

Description

How to reproduce?

Just add 'sctp-role client' to your configuration file, e.g. to osmo-msc.cfg:

cs7 instance 0
  point-code 0.23.1
  asp asp-clnt-OsmoMSC-A 2905 0 m3ua
    remote-ip 127.0.0.1
    sctp-role client ! <-- this line causes OsmoMSC to use 100% CPU
  as as-clnt-OsmoMSC-A m3ua
    asp asp-clnt-OsmoMSC-A
    routing-key 1 0.23.1

What happens?

Looking at the output of strace, it seems like some file descriptor makes the select() loop non-blocking:

select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=519755}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=519751})
select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=519456}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=519450})
select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=519371}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=519367})
select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=519280}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=519277})
select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=519191}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=519187})
select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=519112}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=519109})
select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=519038}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=519035})
select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=518958}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=518955})
select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=518877}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=518872})
select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=518813}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=518810})
select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=518747}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=518744})
select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=518686}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=518683})
select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=518632}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=518619})
select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=518553}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=518550})
select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=518501}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=518498})
select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=518423}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=518420})
select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=518353}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=518350})
select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=518282}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=518278})
select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=518132}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=518128})
select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=518048}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=518045})
select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=517967}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=517963})
select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=517888}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=517885})
select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=517809}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=517806})
select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=517730}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=517727})
select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=517651}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=517647})
select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=517576}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=517573})
select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=517519}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=517517})
select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=517465}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=517461})
select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=517395}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=517392})
select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=517317}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=517314})

lsof tells that fd=4 is an SCTP connection:

$ lsof -p 387341 -ad 4
COMMAND     PID    USER   FD   TYPE DEVICE SIZE/OFF      NODE NAME
osmo-msc 387341 fixeria    4u  sock    0,9      0t0 459782671 protocol: SCTP

OsmoBSC fails to establish connection to the MSC:

Jan 28 16:45:50 DELL osmo-bsc[389147]: DMSC NOTICE a_reset.c:106 A-RESET(msc-0)[0x55b321ad6a70]{DISC}: (re)sending BSSMAP RESET message...
Jan 28 16:45:50 DELL osmo-bsc[389147]: DMSC NOTICE osmo_bsc_sigtran.c:102 Sending RESET to MSC: RI=SSN_PC,PC=0.23.1,SSN=BSSAP
Jan 28 16:45:50 DELL osmo-stp[389002]: DLSS7 ERROR osmo_ss7_hmrt.c:257 MTP-TRANSFER.req for DPC 185: no route!
Jan 28 16:45:52 DELL osmo-bsc[389147]: DMSC NOTICE a_reset.c:106 A-RESET(msc-0)[0x55b321ad6a70]{DISC}: (re)sending BSSMAP RESET message...
Jan 28 16:45:52 DELL osmo-bsc[389147]: DMSC NOTICE osmo_bsc_sigtran.c:102 Sending RESET to MSC: RI=SSN_PC,PC=0.23.1,SSN=BSSAP
Jan 28 16:45:52 DELL osmo-stp[389002]: DLSS7 ERROR osmo_ss7_hmrt.c:257 MTP-TRANSFER.req for DPC 185: no route!
Jan 28 16:45:54 DELL osmo-bsc[389147]: DMSC NOTICE a_reset.c:106 A-RESET(msc-0)[0x55b321ad6a70]{DISC}: (re)sending BSSMAP RESET message...
Jan 28 16:45:54 DELL osmo-bsc[389147]: DMSC NOTICE osmo_bsc_sigtran.c:102 Sending RESET to MSC: RI=SSN_PC,PC=0.23.1,SSN=BSSAP
Jan 28 16:45:54 DELL osmo-stp[389002]: DLSS7 ERROR osmo_ss7_hmrt.c:257 MTP-TRANSFER.req for DPC 185: no route!
Jan 28 16:45:56 DELL osmo-bsc[389147]: DMSC NOTICE a_reset.c:106 A-RESET(msc-0)[0x55b321ad6a70]{DISC}: (re)sending BSSMAP RESET message...
Jan 28 16:45:56 DELL osmo-bsc[389147]: DMSC NOTICE osmo_bsc_sigtran.c:102 Sending RESET to MSC: RI=SSN_PC,PC=0.23.1,SSN=BSSAP
Jan 28 16:45:56 DELL osmo-stp[389002]: DLSS7 ERROR osmo_ss7_hmrt.c:257 MTP-TRANSFER.req for DPC 185: no route!

OsmoSTP has the following routing table:

OsmoSTP# show cs7 instance 0 route 
Routing table = system
C=Cong Q=QoS P=Prio

Destination            C Q P Linkset Name        Linkset Non-adj Route
---------------------- - - - ------------------- ------- ------- -------
0.23.3/14                  0 as-rkm-2            ?       ?       ?      

Some more observations

I don't know the exact role OsmoMSC is supposed to play. Adding this line to osmo-bsc.cfg replicates the same behaviour. Changing sctp-role from 'client' to 'server' or omitting this line makes both OsmoMSC and OsmoBSC talk to each other just fine again.

I had this line in my osmo-msc.cfg since the last breakage of libosmo-sccp, and so far everything was working good (until the recent upgrade). Even if this is an incorrect configuration, we should not keep OsmoMSC running and eating CPU. Aborting the process makes more sense in this case. Otherwise it's not easy to notice this problem.

pespin any ideas?


Files

osmo-stp.log osmo-stp.log 7.75 KB fixeria, 01/28/2020 10:31 AM
osmo-msc.log osmo-msc.log 13.1 KB fixeria, 01/28/2020 10:31 AM
osmo-msc.cfg osmo-msc.cfg 1.17 KB fixeria, 01/28/2020 10:33 AM
osmo-stp.cfg osmo-stp.cfg 392 Bytes fixeria, 01/28/2020 10:33 AM
Actions #1

Updated by fixeria almost 2 years ago

Attaching log files for OsmoSTP and OsmoMSC with 'logging level set-all debug'.

Actions #2

Updated by fixeria almost 2 years ago

Also attaching the configuration files.

Actions #3

Updated by fixeria almost 2 years ago

More details from our cooperative investigation:

(gdb) p ufd->cb                                                                                                                                                          
$4 = (int (*)(struct osmo_fd *, unsigned int)) 0x7ffff7ed50f0 <osmo_stream_cli_fd_cb>

(gdb) p (struct osmo_stream_cli *) ofd->data                                                                                                                             
$5 = (struct osmo_stream_cli *) 0x5555557670d0                                                                                                                           
(gdb) p *(struct osmo_stream_cli *) ofd->data                                                                                                                            
$6 = {ofd = {list = {next = 0x555555676fe0, prev = 0x5555557aef88}, fd = 5, when = 3, cb = 0x7ffff7ed50f0 <osmo_stream_cli_fd_cb>, data = 0x5555557670d0, priv_nr = 0},  
  tx_queue = {next = 0x555555767100, prev = 0x555555767100}, timer = {node = {rb_parent_color = 0, rb_right = 0x0, rb_left = 0x0}, list = {next = 0x0, prev = 0x0},      
    timeout = {tv_sec = 0, tv_usec = 0}, active = 0, cb = 0x7ffff7ed5d20 <cli_timer_cb>, data = 0x5555557670d0}, state = STREAM_CLI_STATE_NONE, addr = {                 
    0x555555766ec0 "127.0.0.1", 0x0 <repeats 31 times>}, addrcnt = 1 '\001', port = 2905, local_addr = {0x0 <repeats 32 times>}, local_addrcnt = 1 '\001',               
  local_port = 0, proto = 132, connect_cb = 0x7ffff7ef4b20 <xua_cli_connect_cb>, disconnect_cb = 0x0, read_cb = 0x7ffff7ef8cb0 <xua_cli_read_cb>, write_cb = 0x0,        
  data = 0x555555766b90, flags = 2, reconnect_timeout = 5}

The state = STREAM_CLI_STATE_NONE causes the switch statement in osmo_stream_cli_fd_cb() to do nothing, so we merely return. That's the reason why select() always gives us the control and does not block. Now the main question is how do we end up in that state.

Actions #4

Updated by pespin almost 2 years ago

  • Status changed from New to Feedback
  • Assignee set to pespin
  • % Done changed from 0 to 90

Fixed by:
remote: https://gerrit.osmocom.org/c/libosmo-netif/+/17029 stream: Drop data during write() while in state NONE
remote: https://gerrit.osmocom.org/c/libosmo-netif/+/17030 stream: Re-arrange cli states to fix 100% cpu usage bu

Extra:
remote: https://gerrit.osmocom.org/c/libosmo-netif/+/17031 stream: Add new WAIT_RECONNECT cli state
remote: https://gerrit.osmocom.org/c/libosmo-netif/+/17032 stream: Rename cli state NONE to CLOSED

Actions #5

Updated by fixeria almost 2 years ago

  • Status changed from Feedback to Resolved
  • Assignee deleted (pespin)
  • % Done changed from 90 to 100

The patches have been merged, both OsmoMSC and OsmoBSC now work like a charm (in both client and server modes).

Thank you, Pau!

Actions #6

Updated by fixeria almost 2 years ago

  • Assignee set to pespin
Actions

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)