Bug #4378
closedHaving 'sctp-role client' in configuration causes high CPU load
100%
Description
How to reproduce?¶
Just add 'sctp-role client' to your configuration file, e.g. to osmo-msc.cfg:
cs7 instance 0 point-code 0.23.1 asp asp-clnt-OsmoMSC-A 2905 0 m3ua remote-ip 127.0.0.1 sctp-role client ! <-- this line causes OsmoMSC to use 100% CPU as as-clnt-OsmoMSC-A m3ua asp asp-clnt-OsmoMSC-A routing-key 1 0.23.1
What happens?¶
Looking at the output of strace, it seems like some file descriptor makes the select() loop non-blocking:
select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=519755}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=519751}) select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=519456}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=519450}) select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=519371}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=519367}) select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=519280}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=519277}) select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=519191}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=519187}) select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=519112}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=519109}) select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=519038}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=519035}) select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=518958}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=518955}) select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=518877}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=518872}) select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=518813}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=518810}) select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=518747}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=518744}) select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=518686}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=518683}) select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=518632}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=518619}) select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=518553}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=518550}) select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=518501}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=518498}) select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=518423}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=518420}) select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=518353}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=518350}) select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=518282}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=518278}) select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=518132}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=518128}) select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=518048}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=518045}) select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=517967}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=517963}) select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=517888}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=517885}) select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=517809}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=517806}) select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=517730}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=517727}) select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=517651}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=517647}) select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=517576}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=517573}) select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=517519}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=517517}) select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=517465}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=517461}) select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=517395}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=517392}) select(11, [3 4 5 6 9 10], [4 10], [], {tv_sec=0, tv_usec=517317}) = 2 (in [4], out [4], left {tv_sec=0, tv_usec=517314})
lsof tells that fd=4 is an SCTP connection:
$ lsof -p 387341 -ad 4 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME osmo-msc 387341 fixeria 4u sock 0,9 0t0 459782671 protocol: SCTP
OsmoBSC fails to establish connection to the MSC:
Jan 28 16:45:50 DELL osmo-bsc[389147]: DMSC NOTICE a_reset.c:106 A-RESET(msc-0)[0x55b321ad6a70]{DISC}: (re)sending BSSMAP RESET message... Jan 28 16:45:50 DELL osmo-bsc[389147]: DMSC NOTICE osmo_bsc_sigtran.c:102 Sending RESET to MSC: RI=SSN_PC,PC=0.23.1,SSN=BSSAP Jan 28 16:45:50 DELL osmo-stp[389002]: DLSS7 ERROR osmo_ss7_hmrt.c:257 MTP-TRANSFER.req for DPC 185: no route! Jan 28 16:45:52 DELL osmo-bsc[389147]: DMSC NOTICE a_reset.c:106 A-RESET(msc-0)[0x55b321ad6a70]{DISC}: (re)sending BSSMAP RESET message... Jan 28 16:45:52 DELL osmo-bsc[389147]: DMSC NOTICE osmo_bsc_sigtran.c:102 Sending RESET to MSC: RI=SSN_PC,PC=0.23.1,SSN=BSSAP Jan 28 16:45:52 DELL osmo-stp[389002]: DLSS7 ERROR osmo_ss7_hmrt.c:257 MTP-TRANSFER.req for DPC 185: no route! Jan 28 16:45:54 DELL osmo-bsc[389147]: DMSC NOTICE a_reset.c:106 A-RESET(msc-0)[0x55b321ad6a70]{DISC}: (re)sending BSSMAP RESET message... Jan 28 16:45:54 DELL osmo-bsc[389147]: DMSC NOTICE osmo_bsc_sigtran.c:102 Sending RESET to MSC: RI=SSN_PC,PC=0.23.1,SSN=BSSAP Jan 28 16:45:54 DELL osmo-stp[389002]: DLSS7 ERROR osmo_ss7_hmrt.c:257 MTP-TRANSFER.req for DPC 185: no route! Jan 28 16:45:56 DELL osmo-bsc[389147]: DMSC NOTICE a_reset.c:106 A-RESET(msc-0)[0x55b321ad6a70]{DISC}: (re)sending BSSMAP RESET message... Jan 28 16:45:56 DELL osmo-bsc[389147]: DMSC NOTICE osmo_bsc_sigtran.c:102 Sending RESET to MSC: RI=SSN_PC,PC=0.23.1,SSN=BSSAP Jan 28 16:45:56 DELL osmo-stp[389002]: DLSS7 ERROR osmo_ss7_hmrt.c:257 MTP-TRANSFER.req for DPC 185: no route!
OsmoSTP has the following routing table:
OsmoSTP# show cs7 instance 0 route Routing table = system C=Cong Q=QoS P=Prio Destination C Q P Linkset Name Linkset Non-adj Route ---------------------- - - - ------------------- ------- ------- ------- 0.23.3/14 0 as-rkm-2 ? ? ?
Some more observations¶
I don't know the exact role OsmoMSC is supposed to play. Adding this line to osmo-bsc.cfg replicates the same behaviour. Changing sctp-role from 'client' to 'server' or omitting this line makes both OsmoMSC and OsmoBSC talk to each other just fine again.
I had this line in my osmo-msc.cfg since the last breakage of libosmo-sccp, and so far everything was working good (until the recent upgrade). Even if this is an incorrect configuration, we should not keep OsmoMSC running and eating CPU. Aborting the process makes more sense in this case. Otherwise it's not easy to notice this problem.
pespin any ideas?
Files
Updated by fixeria about 4 years ago
- File osmo-msc.log osmo-msc.log added
- File osmo-stp.log osmo-stp.log added
Attaching log files for OsmoSTP and OsmoMSC with 'logging level set-all debug'.
Updated by fixeria about 4 years ago
- File osmo-msc.cfg osmo-msc.cfg added
- File osmo-stp.cfg osmo-stp.cfg added
Also attaching the configuration files.
Updated by fixeria about 4 years ago
More details from our cooperative investigation:
(gdb) p ufd->cb $4 = (int (*)(struct osmo_fd *, unsigned int)) 0x7ffff7ed50f0 <osmo_stream_cli_fd_cb> (gdb) p (struct osmo_stream_cli *) ofd->data $5 = (struct osmo_stream_cli *) 0x5555557670d0 (gdb) p *(struct osmo_stream_cli *) ofd->data $6 = {ofd = {list = {next = 0x555555676fe0, prev = 0x5555557aef88}, fd = 5, when = 3, cb = 0x7ffff7ed50f0 <osmo_stream_cli_fd_cb>, data = 0x5555557670d0, priv_nr = 0}, tx_queue = {next = 0x555555767100, prev = 0x555555767100}, timer = {node = {rb_parent_color = 0, rb_right = 0x0, rb_left = 0x0}, list = {next = 0x0, prev = 0x0}, timeout = {tv_sec = 0, tv_usec = 0}, active = 0, cb = 0x7ffff7ed5d20 <cli_timer_cb>, data = 0x5555557670d0}, state = STREAM_CLI_STATE_NONE, addr = { 0x555555766ec0 "127.0.0.1", 0x0 <repeats 31 times>}, addrcnt = 1 '\001', port = 2905, local_addr = {0x0 <repeats 32 times>}, local_addrcnt = 1 '\001', local_port = 0, proto = 132, connect_cb = 0x7ffff7ef4b20 <xua_cli_connect_cb>, disconnect_cb = 0x0, read_cb = 0x7ffff7ef8cb0 <xua_cli_read_cb>, write_cb = 0x0, data = 0x555555766b90, flags = 2, reconnect_timeout = 5}
The state = STREAM_CLI_STATE_NONE causes the switch statement in osmo_stream_cli_fd_cb() to do nothing, so we merely return. That's the reason why select() always gives us the control and does not block. Now the main question is how do we end up in that state.
Updated by pespin about 4 years ago
- Status changed from New to Feedback
- Assignee set to pespin
- % Done changed from 0 to 90
Fixed by:
remote: https://gerrit.osmocom.org/c/libosmo-netif/+/17029 stream: Drop data during write() while in state NONE
remote: https://gerrit.osmocom.org/c/libosmo-netif/+/17030 stream: Re-arrange cli states to fix 100% cpu usage bu
Extra:
remote: https://gerrit.osmocom.org/c/libosmo-netif/+/17031 stream: Add new WAIT_RECONNECT cli state
remote: https://gerrit.osmocom.org/c/libosmo-netif/+/17032 stream: Rename cli state NONE to CLOSED
Updated by fixeria about 4 years ago
- Status changed from Feedback to Resolved
- Assignee deleted (
pespin) - % Done changed from 90 to 100
The patches have been merged, both OsmoMSC and OsmoBSC now work like a charm (in both client and server modes).
Thank you, Pau!