Project

General

Profile

Actions

Feature #6387

open

osmo_io / io_uring support for RTP/RTCP

Added by laforge about 2 months ago. Updated about 1 month ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
03/02/2024
Due date:
% Done:

90%

Spec Reference:
Tags:

Description

The RTP/RTCP sockets of osmo-mgw should be prime candidates for migration to osmo_io and hence benefit from the optional io_uring backend.

Given the many small recvfrom/sendto syscalls on those sockets, performance should be enhanced in a significant way.


Related issues

Related to libosmocore - Feature #5751: io_uring support in libosmocoreResolvedjolly11/09/2022

Actions
Actions #1

Updated by laforge about 2 months ago

  • Tags set to io_uring
Actions #2

Updated by laforge about 2 months ago

  • Related to Feature #5751: io_uring support in libosmocore added
Actions #3

Updated by laforge about 1 month ago

  • Status changed from New to In Progress
  • Assignee set to laforge
  • % Done changed from 0 to 80

The patch is in https://gerrit.osmocom.org/c/osmo-mgw/+/36363 - in my local testing it shows no regression in the TTCN3 test suite. Jenkins however does report regressions in the unit tests, I'll investigate.

In a benchmark running 200 concurrent bi-directional voice calls (set up from mncc-python, using rtpsource as RTP generator) with GSM-EFR codec, I am observing:

  • the code before this patch uses 40..42% of a single core on a Ryzen 5950X at 200 calls (=> 200 endpoints with each two connections)
  • no increase in CPU utilization before/after this patch, i.e. the osmo_io overhead for the osmo_fd backend is insignificant compared to the direct osmo_fd mode before
  • an almost exactly 50% reduction of CPU utilization when running the same osmo-mgw build with LIBOSMO_IO_BACKEND=IO_URING - top shows 19..21% for the same workload instead of 40..42% with the OSMO_FD default backend.
  • an increase of about 4 Megabytes in both RSS and VIRT size when enabling the OSMO_IO backend. This is likely the memory-mapped rings.
Actions #4

Updated by laforge about 1 month ago

When doing a strace on the process, we can now see that the only syscalls really are:
  • poll (including the eventfd of io_uring)
  • the read of said eventfd
  • tons of io_uring_enter() syscalls

The latter are the result of us calling io_uring_submit() after every every individual read or write operation we add to the submission queue.

I've done another experiment to remove thsoe io_uring_submit() calls and do them just before we enter poll(). This indeed removed the duplicate io_uring_enter() syscalls, and we now have an even number of poll, read(eventfd) and io_uring_enter calls. The patch is at https://gerrit.osmocom.org/c/libosmocore/+/36364

However, this is not really making a visible difference in terms of CPU utilization reported by top/ps. Maybe 1% but not more than that; so at least at this relatively low overall CPU load of ~20% it doesn't make a difference. This might change when we get closer to 100% CPU and hence more batching might give more benefits.

FYI, In my 200-calls on 200-endpoints with 400-connections load test, I'm seeing the eventfd signalling something like 3..5 completions each time we poll+read it.

Actions #6

Updated by laforge about 1 month ago

  • % Done changed from 80 to 90

finally ported the failing unit test over to the new code; build verification now passes.

Actions

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)