Project

General

Profile

Bug #2561

osmo-mgw OBS nightly builds failing for 3 consecutive days

Added by laforge 12 days ago. Updated 4 days ago.

Status:
In Progress
Priority:
Urgent
Assignee:
Target version:
-
Start date:
10/09/2017
Due date:
% Done:

0%


Description

I've been receiving OBS build failure notifications like this for about 3 days, so I think since friday:

Visit
https://build.opensuse.org/package/live_build_log/network:osmocom:nitb-split:nightly/osmo-mgw/xUbuntu_16.0
4/x86_64

Package network:osmocom:nitb-split:nightly/osmo-mgw failed to build in xUbuntu_16.04/x86_64

Check out the package for editing:
  osc checkout network:osmocom:nitb-split:nightly osmo-mgw

Last lines of build log:
[  108s]
[  108s] Makefile:693: recipe for target 'check-local' failed
[  108s] make[5]: *** [check-local] Error 1
[  108s] make[5]: Leaving directory '/usr/src/packages/BUILD/tests'
[  108s] Makefile:546: recipe for target 'check-am' failed
[  108s] make[4]: *** [check-am] Error 2
[  108s] make[4]: Leaving directory '/usr/src/packages/BUILD/tests'
[  108s] Makefile:398: recipe for target 'check-recursive' failed
[  108s] make[3]: *** [check-recursive] Error 1
[  108s] make[3]: Leaving directory '/usr/src/packages/BUILD/tests'
[  108s] Makefile:506: recipe for target 'check-recursive' failed
[  108s] make[2]: *** [check-recursive] Error 1
[  108s] make[2]: Leaving directory '/usr/src/packages/BUILD'
[  108s] Makefile:797: recipe for target 'check' failed
[  108s] make[1]: *** [check] Error 2
[  108s] make[1]: Leaving directory '/usr/src/packages/BUILD'
[  108s] dh_auto_test: make -j1 check returned exit code 2
[  108s] debian/rules:17: recipe for target 'build' failed
[  108s] make: *** [build] Error 2
[  108s] dpkg-buildpackage: error: debian/rules build gave error exit status 2
[  108s]
[  108s] lamb68 failed "build osmo-mgw_1.0.2.20171009.dsc" at Mon Oct  9 02:58:31 UTC 2017.
[  108s]
[  108s] ### VM INTERACTION START ###
[  111s] [  103.628587] reboot: Power down
[  111s] ### VM INTERACTION END ###
[  111s]
[  111s] lamb68 failed "build osmo-mgw_1.0.2.20171009.dsc" at Mon Oct  9 02:58:34 UTC 2017.
[  111s]

The builds fail consistently on all architectures, so we must have introduced some regression there. I'm assigning it to neels, as he has committed/pushed the most recent changes that co-incide with the time of failure. Please investigate (feel free to reassign in the sysmocom team, if you think somebody else is better suited to fix it), thanks!

log.txt Magnifier (183 KB) lynxis, 10/11/2017 05:49 AM

History

#1 Updated by neels 11 days ago

  • Status changed from New to Feedback
  • Assignee changed from neels to lynxis

The decisive output is:

[  105s] Regression tests.
[  105s] 
[  105s]   1: legacy_mgcp                                     ok
[  105s]   2: legacy_mgcp-trans                               skipped (testsuite.at:12)
[  105s]   3: mgcp_client                                     ok
[  105s]   4: mgcp                                            FAILED (testsuite.at:27)

But I cannot reproduce the mgcp test failure.

The OBS builds (still) don't output the test logs and I am not sure how we can obtain them.
Something like the cat-testlogs.sh our jenkins builds are doing would be very helpful.
https://git.osmocom.org/osmo-ci/tree/scripts/cat-testlogs.sh

I would like to redirect to lynxis to probe whether we can get the test logs, and to dexter for an mgcp_test.c fix.

#2 Updated by lynxis 9 days ago

  • File log.txtMagnifier added
  • Assignee changed from lynxis to dexter

@neels: I've extended the log similiar to libosmocore [1]. I've manual deployed this patch to OBS. [1] or attached log.txt now shows the exact problem.

@dexter: can you take a look on the test output?

[1] https://gerrit.osmocom.org/#/c/4212/
[2] https://build.opensuse.org/build/network:osmocom:nitb-split:nightly/Debian_9.0/x86_64/osmo-mgw/_log

#3 Updated by lynxis 9 days ago

  • Status changed from Feedback to In Progress

#4 Updated by dexter 7 days ago

  • Assignee changed from dexter to lynxis

I am sorry, without further information such as log outputs I can not help much.

#5 Updated by neels 7 days ago

  • Assignee changed from lynxis to neels

dexter, there is a link right there with log outputs, you should have been able to investigate.

#6 Updated by neels 7 days ago

anyway, here is one again https://build.opensuse.org/package/live_build_log/network:osmocom:nitb-split:nightly/osmo-mgw/xUbuntu_16.10/x86_64
which was also posted on the chat. So it seems like you have other reasons to not look into this.

#7 Updated by neels 7 days ago

  • Assignee changed from neels to dexter

thx

#8 Updated by dexter 7 days ago

#9 Updated by dexter 7 days ago

  • Assignee changed from dexter to neels

It seems that it is a problem with the override of sendto(). I also wonder why
the legacy test does not fail since it is very similar to the mgcp test that
fails. If the following solution works out we might want to apply it there
as well.

In mgcp_test.c one can find an override of sendto:

#define MGCP_DUMMY_LOAD 0x23
static int dummy_packets = 0;
/* override and forward */
ssize_t sendto(int sockfd, const void *buf, size_t len, int flags,
        const struct sockaddr *dest_addr, socklen_t addrlen)
{
    typedef ssize_t (*sendto_t)(int, const void *, size_t, int,
            const struct sockaddr *, socklen_t);
    static sendto_t real_sendto = NULL;
    uint32_t dest_host = htonl(((struct sockaddr_in *)dest_addr)->sin_addr.s_addr);
    int      dest_port = htons(((struct sockaddr_in *)dest_addr)->sin_port);

    if (!real_sendto)
        real_sendto = dlsym(RTLD_NEXT, "sendto");

    if (len == 1 && ((const char *)buf)[0] == MGCP_DUMMY_LOAD ) {
        fprintf(stderr, "Dummy packet to 0x%08x:%d, msg length %zu\n%s\n\n",
               dest_host, dest_port,
               len, osmo_hexdump(buf, len));
        dummy_packets += 1;
    }

    return real_sendto(sockfd, buf, len, flags, dest_addr, addrlen);
}

I have problems to understand this. There is a function pointer real_sendto
that is set to NULL, only to detect a few lines later that it is really NULL
and populate it then with the symbol of the sendto (the real sendto). However,
this happens only once and real_sendto is static. Its either a lengthy process
or it must not happen twice - however.

The interesting part is that the override sendo is counting the transmitted
dummy packets.

After the counting the packets are forwared using the real_sendto. This is
why we see the packets in wireshark. Probably we want to remove this so that
no actual packets leave. At least when I comment the real_sendto out the
test still passes. So that would be a way to get rid of the packets that
OBS probably won't like.

Lets have a look at the log output:

[   80s] <0010> mgcp_network.c:136 endpoint:5 Failed to send dummy RTP packet.
[   80s] [0;m<0010> mgcp_protocol.c:747 CRCX: endpoint:5 connection successfully created
[   80s] [0;m<0010> mgcp_protocol.c:522 CRCX: creating new connection ...
[   80s] [0;m<0010> mgcp_sdp.c:300 Got media info via SDP: port 5904, payload 97 (GSM-EFR), duration 40, addr 123.12.12.123
[   80s] [0;mPolicy CB got state 1 on endpoint 1
[   80s] Dummy packet to 0x7b0c0c7b:5904, msg length 1
[   80s] 23 
[   80s] 
[   80s] <0010> mgcp_network.c:136 endpoint:1 Failed to send dummy RTP packet.
[   80s] [0;m<0010> mgcp_protocol.c:747 CRCX: endpoint:1 connection successfully created
[   80s] [0;m<0010> mgcp_osmux.c:682 All Osmux circuits are in use!
[   80s] [0;m--- expout    2017-10-13 13:15:05.388398199 +0000
[   80s] +++ /usr/src/packages/BUILD/tests/testsuite.dir/at-groups/4/stdout    2017-10-13 13:15:05.436398197 +0000
[   80s] @@ -16,12 +16,12 @@
[   80s]  Testing MDCX1
[   80s]  Testing MDCX2
[   80s]  Testing CRCX
[   80s] -Dummy packets: 2
[   80s] +Dummy packets: 1

We can see that the sendto() override is executed. Then we spot a message
from mgcp_network.c. This could mean that the dummy sendto returned with
an error code. Probably because the real_sendto returned an eror.

The dummy packet transmission terminates on this function:

int mgcp_send_dummy(struct mgcp_endpoint *endp, struct mgcp_conn_rtp *conn)
{
    static char buf[] = { MGCP_DUMMY_LOAD };
    int rc;
    int was_rtcp = 0;

    OSMO_ASSERT(endp);
    OSMO_ASSERT(conn);

    LOGP(DLMGCP, LOGL_DEBUG,
         "endpoint:%x sending dummy packet...\n", ENDPOINT_NUMBER(endp));
    LOGP(DLMGCP, LOGL_DEBUG, "endpoint:%x conn:%s\n",
         ENDPOINT_NUMBER(endp), mgcp_conn_dump(conn->conn));

    rc = mgcp_udp_send(conn->end.rtp.fd, &conn->end.addr,
               conn->end.rtp_port, buf, 1);

    if (rc == -1)
        goto failed;

    if (endp->tcfg->omit_rtcp)
        return rc;

    was_rtcp = 1;
    rc = mgcp_udp_send(conn->end.rtcp.fd, &conn->end.addr,
               conn->end.rtcp_port, buf, 1);

    if (rc >= 0)
        return rc;

failed:
    LOGP(DLMGCP, LOGL_ERROR,
         "endpoint:%x Failed to send dummy %s packet.\n",
         ENDPOINT_NUMBER(endp), was_rtcp ? "RTCP" : "RTP");

    return -1;
}

Here we can see that the mgcp_udp_send() should normally execute twice. This
matches the observed behaviour. We see the output from the sendto() override,
wich executes ok until the point where it outputs. We also count the
dummy_packets one up.

Probably the real_sendto at the end of the sendto() override fails. This
eventually leads into a bad return code of mgcp_udp_send() and we will send
to failed: This is why we see the error message immediately after. By this
we omit the second mgcp_udp_send(), which would normally give us the second
dummy dummy packet.

Since the test still passes when the real_sendto is removed we possibly may
try this. A patch is available for review: https://gerrit.osmocom.org/4260

#10 Updated by neels 7 days ago

  • Assignee changed from neels to dexter

dexter wrote:

I have problems to understand this. There is a function pointer real_sendto
that is set to NULL, only to detect a few lines later that it is really NULL
and populate it then with the symbol of the sendto (the real sendto). However,
this happens only once and real_sendto is static. Its either a lengthy process
or it must not happen twice - however.

The code doesn't want to load the symbol using the dynamic linker more than once.
What I, too, fail to understand is why the test goes to all this trouble when
removing this doesn't seem to affect the test in the slightest.

The interesting part is that the override sendo is counting the transmitted
dummy packets. [...] we see the error message [...] would normally give us the second
dummy dummy packet.

Indeed, this should be the actual cause.
Let's try to clarify why the test was even sending out packets in the first place.
Ask zecke??

#11 Updated by dexter 4 days ago

I will drop the dummy packets completely as demanded in #2574. I wrote Holger an Email since we potentially have the same problem in legacy_mgcp as well.

#12 Updated by zecke 4 days ago

neels wrote:

The code doesn't want to load the symbol using the dynamic linker more than once.
What I, too, fail to understand is why the test goes to all this trouble when
removing this doesn't seem to affect the test in the slightest.

We want to observe the side-effect (and we don't want to have a "network interface" indirection in the code).

Also available in: Atom PDF