Project

General

Profile

Actions

Feature #5751

open

io_uring support in libosmocore

Added by laforge over 1 year ago. Updated 10 days ago.

Status:
In Progress
Priority:
Immediate
Assignee:
Category:
-
Target version:
-
Start date:
11/09/2022
Due date:
% Done:

70%

Spec Reference:
Tags:

Description

Traditionally our I/O abstraction in libosmocore has been select(). In libosmocore 1.5.0 (2020) we migrated over to poll() to support more than 1024 FDs and to avoid the extreme amount of fd-set memcpy()ing involved in the venerable select interface.

Now of course both select and poll are ancient unix interfaces for non-blocking I/O, and both come at a high cost for systems under high load.

Specifically, we are getting reports from osmo-bsc users that indicate a busy BSC with 100 BTS ( 400 TRX)_is spending about 40% of its CPU cycles in the (kernel side) sock_poll, tcp_poll, do_sys_poll.

There are other interfaces such as linux aio, posix aio and epoll, but the brightest and shiniest new I/O interface on Linux is io_uring. Contrary to any of its predecessors, io_uring can, in the "worst" case, operate without any system calls at all anymore. io_uring recognizes that each syscall is associated with a rather high context switch cost.

io_uring consists of memory-mapped (between kernel and userspace process) queues for requests and completions, as well as lockless primitives to enqueue/dequeue from these.

The requests in the queue are requests like read N bytes from this file descriptor or write N bytes to that file descriptor. But io_uring can do much more (many other syscalls), though the read/write is the most relevant part to us.

we already have two io_uring users in the osmocom universe: the GTP and the UDP/RTP load generators I wrote some time ago. They manage their file descriptors internally.

This ticket is now about introducing io_uring support into libosmocore itself, in a way to enable all osmocom programs to use that shared infrastructure.

Conceptual differences

reading from a socket

Conceptually, the existing code typically works like this:

  1. register some socket file descriptor for read
  2. libosmocore includes it in the poll-set
  3. libosmocore calls poll()
  4. kernel returns from poll, indicating fd is readable
  5. libosmocore dispatches to the application call-back
  6. application allocates msgb, reads data from socket
  7. application processes data in msgb

With io_uring, this model needs to change to something like this:

  1. application tells us it wants to read from a socket
  2. libosmocore or application pre-allocate the msgb
  3. libosmocore uses liburing to add a read request to the io_uring submission queue
  4. kernel signals us at some point a completion event via io_uring / liburing
  5. libosmocore dispatches pre-filled msgb to application call-back
  6. application processes data n msgb

So as we can see, the responsibility for the actual reading transfers from application (or intermediate library like libosmo-netif / libosmo-sigtran) into library.

writing to a socket

Conceptually, the existing code typically works like this:

  1. register some socket file descriptor for read
  2. libosmocore includes it in the poll-set
  3. libosmocore calls poll()
  4. kernel returns from poll, indicating fd is writeable
  5. libosmocore dispatches to the application call-back
  6. application writes data to msgb and free's msgb.

With io_uring, this model needs to change to something like this:

  1. application tells us it wants to write to a socket, including the msgb
  2. libosmocore uses liburing to add a write request to the io_uring submission queue
  3. kernel signals us at some point a completion event via io_uring / liburing
  4. libosmocore releases the msgb with msgb_free()

Again, the actual reading/writing passes into the library, and outside the scope of the application (or intermediate library like libosmo-netif / libosmo-sigtran)


Checklist

  • sctp support in osmo_io
  • port vty over to osmo_io
  • port ctrl over to osmo_io

Related issues

Related to libosmo-sccp + libosmo-sigtran - Feature #5752: io_uring support in libosmo-sigtranResolvedjolly11/09/2022

Actions
Related to libosmo-netif - Feature #5753: io_uring support in libosmo-netifResolvedHoernchen11/09/2022

Actions
Related to OsmoMGW - Feature #5754: io_uring support in libosmo-mgcp-clientNewjolly11/09/2022

Actions
Related to OsmoBSC - Feature #5755: io_uring support in osmo-bscNewjolly11/09/2022

Actions
Related to libosmo-abis - Bug #5756: io_uring support in libosmo-abisNewdaniel11/09/2022

Actions
Related to libosmo-abis - Feature #5766: use Linux kernel KCM for IPA header?New11/13/2022

Actions
Related to Cellular Network Infrastructure - Bug #5948: Fix socket (-write) functions in multiple projects (by moving them to a common library...)New03/19/2023

Actions
Related to Core testing infrastructure - Feature #6357: run (some?) tests with io_uring backend for osmo_ioNewosmith02/09/2024

Actions
Actions #1

Updated by laforge over 1 year ago

I'd like the idea of splitting tihs into two separate sub-tasks:

  1. introduce the conceptual API changes of having the actual read/write done inside libosmocore; then start to port applications over to that new API
  2. subsequently (and fully optionally) introduce an io_uring backend to libosmocore so it can benefit from the related performance improvements.

By splitting this is up into two parts, we can more easily pinpoint any related problems, as we can test one part without the other.

Furthermore, on any older systems that don't have kernels with io_uring support, we can simply not use it, as the second step is independent of the first step. The applications simply always use the same API, whether or not libosmocore uses io_uring becomes an implementation detail unknown to the applications.

Actions #2

Updated by laforge over 1 year ago

  • Related to Feature #5752: io_uring support in libosmo-sigtran added
Actions #3

Updated by laforge over 1 year ago

  • Related to Feature #5753: io_uring support in libosmo-netif added
Actions #4

Updated by laforge over 1 year ago

  • Tags set to io_uring
Actions #5

Updated by laforge over 1 year ago

  • Related to Feature #5754: io_uring support in libosmo-mgcp-client added
Actions #6

Updated by laforge over 1 year ago

Actions #7

Updated by laforge over 1 year ago

  • Related to Bug #5756: io_uring support in libosmo-abis added
Actions #8

Updated by laforge over 1 year ago

for some existing example how to use io_uring in the osmocom context, check out rtp-load-gen at https://gitea.osmocom.org/cellular-infrastructure/osmo-mgw/src/branch/laforge/rtp-load-gen/contrib/rtp-load-gen and grep for io_uring_ showing the various API calls. There's also https://gitea.osmocom.org/cellular-infrastructure/gtp-load-gen

  • io_uring_get_sqe returns an unused submission queue entry
  • io_uring_prep_write and io_uring_prep_write fills that submission queue entry with a fd, pointer to data + length
  • io_uring_submit submits whatever prepared submission queue entries
also see:

The libosmocore integration with the existing select/poll would likely be done via an eventfd. So applications will continue to use osmo_select_main() etc. and can use any number of their file descriptors as they did so far. But libosmocore will internally register an eventfd with the existing select/poll API, so that any time io_uring wants to notify us about completions, it marks that eventfd as readable, triggering our select/poll loop to handle those completion events. So why is this faster? Because there will be one such eventfd-poll-trigger for a virtually unlimited number of io_uring completion events, as opposed to one poll+read/write syscall for each of them.

Actions #9

Updated by Hoernchen over 1 year ago

Please keep in mind that IORING_REGISTER_IOWQ_AFF is a fairly recent feature, so unless that exists "automatically" turning on uring support, if available, leads to a bunch of theads ( as for the number and other details: https://blog.cloudflare.com/missing-manuals-io_uring-worker-pool/ is worth a read) that just end up somewhere, without easy ways to move those to a specific cpu.

Actions #10

Updated by laforge over 1 year ago

On Wed, Nov 09, 2022 at 01:58:54PM +0000, Hoernchen wrote:

Please keep in mind that IORING_REGISTER_IOWQ_AFF is a fairly recent feature, so unless that exists "automatically" turning on uring support, if available, leads to a bunch of theads ( as for the number and other details: https://blog.cloudflare.com/missing-manuals-io_uring-worker-pool/ is worth a read) that just end up somewhere, without easy ways to move those to a specific cpu.

AFAICT there are no kernel threads created for socket read/write, as sockets support non-blocking operation.

99.9% of all I/O we are doing is on sockets (UDP, TCP, SCTP, Unix) for talking to other network elements or
the user via VTY/CTRL. There is a bit of file I/O when reading config files (not worth optimzing anyway) and from osmo-hlr / osmo-msc for the respective database, which is accessed in blocking I/O anyway.

Actions #11

Updated by laforge over 1 year ago

  • Related to Feature #5766: use Linux kernel KCM for IPA header? added
Actions #12

Updated by laforge over 1 year ago

  • Assignee changed from laforge to daniel

Update: I've been playing for a few days with some of the concepts and trying to bring all our requirements in-line toward the first step (new API that can support poll and later io_uring backend).

I've handed this over to daniel now as he has more time available right now and indicated an interest in this topic. We just had a call where I explained my thoughts and the latest results how I think it shuold all be put together.

I'm of course available whenever feedback/questions arise.

Actions #13

Updated by laforge over 1 year ago

summary of some of my ideas / thoughts on the new I/O provider so far:

  • modes. The new I/O provider will need to offer the following modes:
    • read/write (e.g. tcp sockets for IPA OML/RSL/GSUP as well as CBSP, VTY, CTRL, ...)
    • recvfrom/sendto (e.g. UDP sockets used for RTP, GTP, MGCP, ...)
      • io_uring doesn't directly support those syscalls. However, it does support recvmsg/sendmsg, which is a superset of recfrom/sendto combined with readv/writev
      • we have to convert recfrom/sendto by API users (applications) to recvmsg/sendmsg
    • sctp_recvfrom/sctp_sendto (SCTP sockets for anything M3UA/SUA/sigtran)
      • this API from libsctp is just a 20-line wrapper around normal recvmsg/sendmsg calls
      • we have to re-implement this wrapper in our io_uring code
  • introduction of a new struct osmo_io_fd which will be used instead of osmo_fd, containing
    • fd
    • const char *name for application to provide a human-readable name of the FD (in case I/O provider wants to log something)
    • parameters for msgb_alloc (headroom, context, size)
    • a built-in write-queue with semantics like osmo_wqueue
    • call-back functions for the user application (read/write completion call-backs)
    • priv/priv_nr for context of application (like osmo_fd)
  • write operation
    • application does something like osmo_io_write(struct osmo_io_fd *, struct msgb *)
    • I/O provider enqueues any write into write queue and marks FD as "wants to write"
    • io_uring backend
      • would check if write is pending completion. If not, submit first entry of write_queue to io_uring
      • at some later point, I/O provider io_uring backend is notified via osmo_fd-wapped-eventfd that io_uring has completed something
      • once I/O provider io_uring backend identifies a write has completed, it will call the io_fd->write_cb(struct osmo_io_fd *fd, int rc, struct msgb *msg) call-back
    • classic poll backend
      • would now check if OSMO_FD_WRITE is active. If not, set it.
      • gets notified that osmo_fd is writable
      • issues normal non-blocking write() syscall
      • call the io_fd->write_cb(struct osmo_io_fd *fd, int rc, struct msgb *msg) call-back
    • the application can now act basd on rc (short write, negative error, dead socket, etc)
    • once call-back returns, I/O provider does msgb_free(msg)
  • read operation
    • application notifies I/O provider that it wants to read from osmo_io_fd
    • io_uring backend
      • allocates a msgb (using parameters provided by application stored in osmo_io_fd
      • submits a read() syscall to io_uring submission queue pointing to msgb memory
      • completion is handled just like the write completion via osmo_fd-wrapped-eventfd
      • io_fd->read_cb(struct osmo_io_fd *fd, int rc, struct msgb *msg) is called
    • classic poll backend
      • enables OSMO_FD_READ on socket
      • gets notified that osmo_fd is readable once data is available
      • allocates a msgb (using parameters provided by application stored in osmo_io_fd
      • issues normal non-blocking read() syscall
      • io_fd->read_cb(struct osmo_io_fd *fd, int rc, struct msgb *msg) is called
Actions #14

Updated by laforge over 1 year ago

For the {send,recv}{to,from,msg}() family of calls, we need to extend the above slightly. In addition the raw msgb, we have metadata like the struct sockaddr to send to.

I originally thought we could push this to the front of the msgb headroom, but sockaddr_storage is already 128 bytes plus the struct msghdr struct iovec etc. quickly adds up to something like 200 bytes. Since msgb size (including headroom) is limited to 16bit (historical mistake), I'm not sure if it's the right way.

I then decided to go for a struct serialized_msghdr which we allocate at the time the user issues e.g. a osmo_io_sendto(struct osmo_io_fd *, struct msgb *msg, int flags, const struct sockaddr *dest_addr, socklen_t addrlen) call. The function would then copy the provided parameters into that heap-allocated serialized_msghdr, and enqueue that (instead of the pure msg) into the in-memory transmit queue. Once the actual sendmsg call is performed (async via io_uring or directly via syscall), we dequeue that msghdr and make use of it. On completion we call the user completion call-back and then free the serialized_msghdr as well as the msgb afterwards.

The same approach also works for the recvmsg/recvffrom case, where we can have an application call-back like void (*recvfrom_cb)(struct osmo_io_fd *iofd, int rc, struct msgb *msg, struct sockaddr *src_addr, socklen_t *addrlen);

Equally this approach works for sctp_sendmsg/sctp_recvmsg as those are just wrappers with different function arguments that all get encoded into a struct msghdr.

Actions #15

Updated by laforge over 1 year ago

  • Priority changed from Normal to Urgent
Actions #16

Updated by daniel over 1 year ago

  • Status changed from New to In Progress
Actions #17

Updated by daniel about 1 year ago

  • % Done changed from 0 to 30

An update on the io_uring osmo_io progress so far:
The WIP commits are in libosmocore.git branch daniel/io_uring.
https://gitea.osmocom.org/osmocom/libosmocore/src/branch/daniel/io_uring

What's done

I managed to get a basic version of osmo_io working with the poll backend and have also working backend for io_uring.

With it the NS2 UDP socket used osmo_io in with sendto()/recvfrom(). The control interface is also using osmo_io complete with IPA parsing/segmentation (with read()/write() mode).

With this the ttcn3 osmo-gbproxy tests (which also uses the ctrl_if) as well as make distcheck pass.

libosmocore currently tries to build with uring support unless passed `--disable-uring` during configure. The default will be io_uring if it's enabled.
The environment variable `LIBOSMO_IO_BACKEND` can be used to switch backends at runtime. Setting it to something other than "IO_URING" will use the poll/osmo_fd backend. This can be verified by setting the new DIO loglevel to DEBUG and watching for the message:
"iofd(<name>) using backend poll/uring"

Open issues

  • Porting over the ipa.c/ipaccess.c code in libosmo-abis will be a significant amount of work since quite a few functions get direct access to an osmo_fd of even a plain fd. They then write()/send() directly to those which will need to move to a tx queue-aware model.
  • libosmo-netif has some similar issues in its ipa code, but in general looks much better because the osmo_stream api already uses a tx_queue internally and matches the callback api of osmo_io much better.
  • sctp support is not implemented in osmo_io yet. This will be a wrapper around send/recvmsg, so shouldn't be too complicated.

API notes

The osmo_io api currently has a _setup function that takes and registers a plain fd and returns a newly allocated struct osmo_io_fd *. This worked ok for ctrl_if and gprs_ns2, but I noticed in a couple places in libosmo-abis that the osmo_fd struct (with callbacks, data, ...) is initialized in one part of the code with the fd set to -1 and only registered in another when the fd is actually present.

Right now the osmo_io assumes that the fd is configured/connected/... correctly and will not do anything there except try to read/write from it. This should be ok for now and you can always get the raw fd and do some get/setsockopts on there.

Actions #18

Updated by laforge 11 months ago

  • Related to Bug #5948: Fix socket (-write) functions in multiple projects (by moving them to a common library...) added
Actions #19

Updated by laforge 11 months ago

  • Priority changed from Urgent to Immediate

This ticket is in need of updates for months. The branch has not seen any commits since early December. Yet from spoken status reports I know there has been more recent activity.

Please make sure to update the relevant tickets and keep pushing the current branches, thanks.

Actions #20

Updated by pespin 10 months ago

In order to push this forward a bit while daniel is not available, I tooked over his patches (libosmocore.git daniel/io_uring), rebased and fixed/improved several things, submitted to gerrit and they are now available in branch "pespin/io_uring".

Actions #21

Updated by osmith 6 months ago

  • % Done changed from 30 to 40

The patch has been merged to master:
https://gerrit.osmocom.org/c/libosmocore/+/32536

I've adjusted infrastructure to fix failing builds related to the new liburing dependency:
https://gerrit.osmocom.org/q/topic:osmo-io+author:osmith%2540sysmocom.de+-is:merged

Actions #22

Updated by laforge 3 months ago

  • Checklist item sctp support in osmo_io added
  • Assignee changed from daniel to laforge

regarding the high-level aspects of SCTP support, see some updates in #5752#note-9

One of the unexpected problems is that msgb_sctp_{ppid,stream} is definted in libosmo-netif and hence is not available in libosmocore. We hence cannot use those existing definitions to pass parameters around in msgb :/ - and as usual, moving stuff between libraries is hard as it might break users and lead to duplicate definitions, etc.

Actions #23

Updated by laforge about 2 months ago

  • Assignee changed from laforge to Hoernchen

My latest WIP codde is in laforge/osmo_io_sctp branch of libosmocore.

As far as I recall, one of the last problems I saw was that the well-known mechanism for non-blocking connect (mark FD as want-to-write, then call connect(), poll/select will mark socket as write-able) did not work with SCTP and io_uring. It did work for TCP sockets, but not for SCTP.

So as a result, we likely need some kind of work-around where we first bypass io_uring and simply use the normal select/poll code until the socket is connected, and only then start to use io_uring after the connect phase is completed.

Actions #24

Updated by laforge about 2 months ago

  • Checklist item port vty over to osmo_io added
  • Checklist item port ctrl over to osmo_io added
Actions #25

Updated by laforge 20 days ago

  • Assignee changed from Hoernchen to jolly
Actions #26

Updated by laforge 16 days ago

  • Related to Feature #6357: run (some?) tests with io_uring backend for osmo_io added
Actions #27

Updated by laforge 16 days ago

stats_reporter

I briefly looked at migrating the osmo_stats_reporter over to osmo_io, and I'm not entirely sure if it's that great an idea. Right now each stats_reporter has one msgb that's allocated at socket-open time. Whenever there's something to write, that buffer is used and then immediately sent off using sendto(). There's no integration into the osmocom select loop. We always assume the [udp] socket is writable. The buffers are hence never free'd or re-allocated at runtime.

If we switch over to osmo_io, then it would mean every stats report allocates a new msgb, and once that's handed over to the io_uring backend there are even more allocations. So yes, we'd save the sendto system call, but at the cost of more load on the heap allocator. The syscall is likely more expensive, sure. But is it worth it? I'm not so sure.

ctrl

that should be fairly trivial to mogirate over. It uses a tx_queue anyway for writes today, so handing those msgb's over to osmo_io should be easy.

vty

The VTY uses its 'buffer' layer between writes by the software and writing to the acutal socket file descriptor. buffer_flush_all is currently used whenever the socket is write-able. So it's a pull model.

We'd probably have to change that logic to work the other way around: Once a buffer has a certain fill-level (or age?), proactively push it via osmo_io.

Unless somebody uses a lot of scripts acccessing the VTY, it's also unlikely that the syscall load of a human VTY user would place significant I/O load on an osmocom program. So it's not super criticial.

Actions #28

Updated by jolly 10 days ago

  • % Done changed from 40 to 70

Patches are in Gerrit as 'WIP' with topic osmo_io_uring. Same with libosmo-netif.

Actions

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)