Project

General

Profile

Actions

Bug #6279

closed

stream_cli fails to connect when using SCTP if no local address is set

Added by laforge 3 months ago. Updated 17 days ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
-
Start date:
11/29/2023
Due date:
% Done:

90%

Spec Reference:

Description

In the context of developing osmo_io/io_uring SCTP support, I was trying to build a simple test case, and remembered the stream-server/stream-client programs of libosmo-netif.

Those tools are made for TCP, but one can of course add simple calls to osmo_stream_*_set_proto(foo, IPPROTO_SCTP). For the server that appears to work, but for the client it doesn't: osmo_stream_cli_open returns -22/EINVAL.

I've tracked this down to the difference in setting up the socket:osmo_sock_init2_multiaddr2 (SCTP) vs osmo_sock_init2 (TCP).

It seems that the multiaddr variant fails with -EINVAL if the BIND flag is set and no single local address is given, while the regular variant osmo_sock_init2 succeeds. Is this an intentional difference?

I would expect the BIND flag to behave like usual if no local address is provided: Bind to INADDR_ANY/IN6ADDR_ANY.

Slightly unrlated: One could also argue why does libosmo-netif stream client set the BIND flag if it doesn't really want to bind to a non-specific address? Well, it could want to bind to a specific local port.


Related issues

Related to libosmo-netif - Feature #5753: io_uring support in libosmo-netifResolvedHoernchen11/09/2022

Actions
Related to OsmoSTP - Bug #5186: local-ip config parameter can cause unclear behaviour on dual IP stack systemResolvedpespin06/18/2021

Actions
Actions #1

Updated by laforge 3 months ago

  • Related to Feature #5753: io_uring support in libosmo-netif added
Actions #2

Updated by laforge 3 months ago

  • Status changed from New to Feedback
  • Assignee changed from sur5r to pespin
Actions #3

Updated by pespin 3 months ago

  • Related to Bug #5186: local-ip config parameter can cause unclear behaviour on dual IP stack system added
Actions #4

Updated by pespin 3 months ago

The osmo_sock_init2_multiaddr2() is properly returning an error if OSMO_SOCK_F_BIND is being passed to it and local_hosts_cnt=0 is provided.

Now, the problem comes when one wants to use the "bind-to-NULL" address behavior. This has so far been unsupported in osmo_sock_init2_multiaddr2(). In order to run that scenario with osmo_sock_init2_multiaddr2 API one should pass: local_hosts_cnt=1, local_hosts0=NULL.

The main (and mostly solely) user of osmo_sock_init2_multiaddr2() is osmo_stream API. Specifically at osmo_stream_cli, we always relied on the user of the osmo_stream API (eg. libosmo-sccp) to set the local address to something meaningful (non-NULL). See for instance osmo_ss7_vty_go_parent():

       case L_CS7_ASP_NODE:
            asp = vty->index;
            /* Make sure proper defaults values are set */
            ss7_asp_set_default_peer_hosts(asp);
            osmo_ss7_asp_restart(asp);
            vty->node = L_CS7_NODE;
            vty->index = asp->inst;
            break;
....
       case L_CS7_XUA_NODE:
            oxs = vty->index;
            /* If no local addr was set, or erased after _create(): */
            ss7_xua_server_set_default_local_hosts(oxs);
            if (osmo_ss7_xua_server_bind(oxs) < 0)
                  vty_out(vty, "%% Unable to bind xUA server to IP(s)%s", VTY_NEWLINE);
            vty->node = L_CS7_NODE;
            vty->index = oxs->inst;
            break;

From above, pay attention to ss7_xua_server_set_default_local_hosts():

298:bool ss7_xua_server_set_default_local_hosts(struct osmo_xua_server *oxs)
299-{
300-    /* If no local addr was set, or erased after _create(): */
301-    if (!oxs->cfg.local.host_cnt) {
302-            /* "::" Covers both IPv4 and IPv6 */
303-            if (ss7_ipv6_sctp_supported("::", true))
304-                        osmo_ss7_xua_server_set_local_host(oxs, "::");
305-            else
306-                        osmo_ss7_xua_server_set_local_host(oxs, "0.0.0.0");
307-            return true;
308-    }
309-    return false;
310-}

And ss7_asp_set_default_peer_hosts() is even more complex.

Now, when using osmo_stream directly, eg. examples/stream-client from libosmo-netif, we simply have:

    conn = osmo_stream_cli_create(tall_test);
    if (conn == NULL) {
        fprintf(stderr, "cannot create cli\n");
        exit(EXIT_FAILURE);
    }
    osmo_stream_cli_set_name(conn, "stream_client");
    osmo_stream_cli_set_addr(conn, "::1");
    osmo_stream_cli_set_port(conn, 10000);
    if (use_sctp)
        osmo_stream_cli_set_proto(conn, IPPROTO_SCTP);

    osmo_stream_cli_set_connect_cb(conn, connect_cb);
    osmo_stream_cli_set_disconnect_cb(conn, disconnect_cb);
    osmo_stream_cli_set_read_cb2(conn, read_cb);

    rc = osmo_stream_cli_open(conn);
    if (rc < 0) {
        fprintf(stderr, "cannot open cli: %d\n", rc);
        exit(EXIT_FAILURE);
    }

So when osmo_stream_cli_open() is called, who's the responisble of creating the socket through osmo_sock_init2_multiaddr2(), used to pass local_addrcnt=0 if no address was set, and creating the socket failed as expected, because so far it was assumed necessary to set some local address in the SCTP case. This is also related to #5186.

We can allow the "bind-to-null by default if no local address has been set in osmo_stream object" behavior simply with this patch:

diff --git a/src/stream_cli.c b/src/stream_cli.c
index 1e7ebeb..086d545 100644
--- a/src/stream_cli.c
+++ b/src/stream_cli.c
@@ -839,6 +839,7 @@ void osmo_stream_cli_set_nodelay(struct osmo_stream_cli *cli, bool nodelay)
 int osmo_stream_cli_open(struct osmo_stream_cli *cli)
 {
        int ret, fd = -1;
+       unsigned int local_addrcnt = cli->local_addrcnt;

        /* we are reconfiguring this socket, close existing first. */
        if ((cli->flags & OSMO_STREAM_CLI_F_RECONF) && osmo_stream_cli_get_fd(cli) >= 0)
@@ -856,8 +857,11 @@ int osmo_stream_cli_open(struct osmo_stream_cli *cli)
                switch (cli->proto) {
 #ifdef HAVE_LIBSCTP
                case IPPROTO_SCTP:
+                       /* If no local addr configured, use local_addr[0]=NULL by default when creating the socket. */
+                       if (local_addrcnt == 0)
+                               local_addrcnt = 1;
                        ret = osmo_sock_init2_multiaddr2(cli->sk_domain, cli->sk_type, cli->proto,
-                                                       (const char **)cli->local_addr, cli->local_addrcnt, cli->local_port,
+                                                       (const char **)cli->local_addr, local_addrcnt, cli->local_port,
                                                        (const char **)cli->addr, cli->addrcnt, cli->port,
                                                        OSMO_SOCK_F_CONNECT|OSMO_SOCK_F_BIND|OSMO_SOCK_F_NONBLOCK,
                                                        &cli->ma_pars);

With that patch, the "libosmo-netif/examples/stream-client -s" works when connecting using local address NULL and remote address "127.0.0.1". However, it fails when using same local address but using remote address IPv6 "::1":

0001> /home/pespin/dev/sysmocom/git/libosmocore/src/core/socket.c:887 Invalid v4 vs v6 in local vs remote addresses: local: v4 remote: v6
<0003> /home/pespin/dev/sysmocom/git/libosmo-netif/src/stream_cli.c:145 CLICONN(stream_client,){WAIT_RECONNECT} retrying reconnect in 5 seconds...
cannot open cli: -22

IIRC that happens because when resolving NULL to an address through getaddrinfo, it ends up resolving to an AF_INET address in first instance, and later on on the remote side "::1" (or even worse, "localhost") is resolved to an AF_INET6 address. So IIRC you end up with local address "0.0.0.0" and remote address "::1", which looks bad.

Actions #5

Updated by pespin 3 months ago

For NULL string, getaddrinfo returns the following entries:

(gdb) print addrinfo[i]
$4 = (struct addrinfo *) 0x606000000080
(gdb) print *addrinfo[i]
$5 = {ai_flags = 1, ai_family = 2, ai_socktype = 1, ai_protocol = 132, ai_addrlen = 16, ai_addr = 0x6060000000b0, ai_canonname = 0x0,
  ai_next = 0x607000000020}
(gdb) print *addrinfo[i]->ai_next
$6 = {ai_flags = 1, ai_family = 10, ai_socktype = 1, ai_protocol = 132, ai_addrlen = 28, ai_addr = 0x607000000050, ai_canonname = 0x0, ai_next = 0x0}

First one is IPv4 (0.0.0.0, ai_family=2=AF_INET)
Second one is IPv6 ("::" , ai_family=10=AF_INET6).

The problem is that later on, we only take the first option into account when checking for matches on what to use (IPv4 vs IPv6) based on IP address found in local and remote:

/* Check whether there's an IPv6 Addr as first option of any addrinfo item in the addrinfo set */
static void addrinfo_has_v4v6addr(const struct addrinfo **result, size_t result_count, bool *has_v4, bool *has_v6)
{
    size_t host_idx;
    *has_v4 = false;
    *has_v6 = false;

    for (host_idx = 0; host_idx < result_count; host_idx++) {
        if (result[host_idx]->ai_family == AF_INET)
            *has_v4 = true;
        else if (result[host_idx]->ai_family == AF_INET6)
            *has_v6 = true;
    }
}

So the "::" from "NULL" gets discarded, assume NULL -> "0.0.0.0" because it's the first entry returned by getaddrinfo. And this is all a bit fucked up, because on the remote side, using "localhost" returns an IPv6 if IPv6 is enabled, but not in the local side when NULL is provided....

The problem is that having to take into account the several entries of each getaddrinfo for each address in the local and remote set, is quite challenging. There's even a TODO in the code in osmo_sock_init2_multiaddr2() I added when initially implementing the function:

        /* Build array of addresses taking first entry for each host.
           TODO: Ideally we should use backtracking storing last used
           indexes and trying next combination if connect() fails .*/

And AFAIR the summary at the time was: current implementation is good enough (TM) and willing to fix the NULL case in all scenarios (multiple IPv4 and IPv6 addresses) is going into a rabbit hole.

Actions #6

Updated by laforge 3 months ago

So the first that's puzzling me is why the getaddrinfo would first return an IPv4 address, and then IPv6 only as a second result. Normally the OS default (at least on Linux, but I thought that is universal) is to prefer IPv6 over IPv4, as long as v6 is available.

It seems https://www.rfc-editor.org/rfc/rfc6724 is specifying this (at least important parts of it), and that /etc/gai.conf can set specific priorities/precedence. Assuming there are no local/distribution specific overrides for the remainder of this discussion.

https://stackoverflow.com/questions/65888168/why-does-getaddrinfo-return-first-ipv4-instead-of-ipv6-when-ai-passive-is-set
also covers that topic. It quotes the W. Richard Stevens "Unix Network Programming" as stating that "it makes sense to return the IPv6 wild card address first"

I guess we'll have to look at glibc sources and/or commit history.

On Mon, Dec 04, 2023 at 04:42:36PM +0000, pespin wrote:

The problem is that having to take into account the several entries of each getaddrinfo for each address in the local and remote set, is quite challenging. There's even a TODO in the code in osmo_sock_init2_multiaddr2() I added when initially implementing the function:

>         /* Build array of addresses taking first entry for each host.
>            TODO: Ideally we should use backtracking storing last used
>            indexes and trying next combination if connect() fails .*/
> 

That's why all of the getaddrinfo() documentation states one must always iterate over all entries... which
makes a lot of sense if you're doing actual name resolution using DNS where there can be many record. We're a bit of a strange user in that we don't do address resolution, but just string -> address conversion.

Actions #7

Updated by pespin 3 months ago

FYI, this code in libosmocore socket.c also seems to document a bit the order of getaddrinfo():

/* Build array of addresses taking first addrinfo result of the requested family
 * for each host in addrs_buf. */
static int addrinfo_to_sockaddr(uint16_t family, const struct addrinfo **result,
                const char **hosts, unsigned int host_cont,
                uint8_t *addrs_buf, size_t addrs_buf_len) {
    size_t host_idx, offset = 0;
    const struct addrinfo *rp;

    for (host_idx = 0; host_idx < host_cont; host_idx++) {
        /* Addresses are ordered based on RFC 3484, see man getaddrinfo */
        for (rp = result[host_idx]; rp != NULL; rp = rp->ai_next) {

"Addresses are ordered based on RFC 3484, see man getaddrinfo"

Actions #8

Updated by pespin 3 months ago

Ah I see now that the rfc6724 you mention makes rfc3484 obsolete.

Actions #9

Updated by pespin 3 months ago

In any case, if the order of getaddrinfo() would be the opposite, we'd still face a problem in the opposite scenario, aka where local=NULL remote=127.0.0.1 is passed. We'd end up with an AF_INET6 local address and a AF_INET remote address.
So again, we could try improving the logic trying to match both sides, but it easily becomes a complex algorithm to find a proper match.

Actions #10

Updated by laforge 3 months ago

pespin wrote in #note-9:

So again, we could try improving the logic trying to match both sides, but it easily becomes a complex algorithm to find a proper match.

Looks like there really is no way around going through the effort of finding a proper match. We're in the lucky situation that we already know both the local and the remote "hostnames" at the same time in that function, so it should be possible to find matches.

Actions #11

Updated by pespin 3 months ago

Maybe we are really good enough simply modifying osmo_stream_client_open() to avoid passing OSMO_SOCK_F_BIND if "cli->local_addrcnt 0 && cli->local_port 0", that way we skip the problem in the easy case where the user of the osmo_stream_cli API doesn't set local address not port since only cares about connecting to a specific remote addr+port.

Actions #12

Updated by laforge 3 months ago

pespin wrote in #note-11:

Maybe we are really good enough simply modifying osmo_stream_client_open() to avoid passing OSMO_SOCK_F_BIND if "cli->local_addrcnt 0 && cli->local_port 0", that way we skip the problem in the easy case where the user of the osmo_stream_cli API doesn't set local address not port since only cares about connecting to a specific remote addr+port.

Yes, that sounds like a good interim workarond for the time being.

Actions #13

Updated by pespin 3 months ago

  • % Done changed from 0 to 90

I reimplemented osmo_sock_init2_multiaddr2() and now it works fine under src=NULL situation.
https://gerrit.osmocom.org/c/libosmocore/+/35232 Reimplement osmo_sock_init2_multiaddr()

It will need an extra patch in libosmo-netif osmo_stream_cli to set loc_hosts_cnt=1 and loc_hosts0=NULL if none is set before open() time:
https://gerrit.osmocom.org/c/libosmo-netif/+/35235 stream_cli: Fix opening sctp client socket if no local address set

Actions #14

Updated by pespin 3 months ago

  • Assignee changed from pespin to laforge

Merged, reassigning to laforge in case he wants to give it a try now or if he finds more related issues.
Otherwise it can be closed.

Actions #15

Updated by laforge 17 days ago

  • Status changed from Feedback to Resolved
Actions

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)