Bug #2561
closedosmo-mgw OBS nightly builds failing for 3 consecutive days
100%
Description
I've been receiving OBS build failure notifications like this for about 3 days, so I think since friday:
Visit https://build.opensuse.org/package/live_build_log/network:osmocom:nitb-split:nightly/osmo-mgw/xUbuntu_16.0 4/x86_64 Package network:osmocom:nitb-split:nightly/osmo-mgw failed to build in xUbuntu_16.04/x86_64 Check out the package for editing: osc checkout network:osmocom:nitb-split:nightly osmo-mgw Last lines of build log: [ 108s] [ 108s] Makefile:693: recipe for target 'check-local' failed [ 108s] make[5]: *** [check-local] Error 1 [ 108s] make[5]: Leaving directory '/usr/src/packages/BUILD/tests' [ 108s] Makefile:546: recipe for target 'check-am' failed [ 108s] make[4]: *** [check-am] Error 2 [ 108s] make[4]: Leaving directory '/usr/src/packages/BUILD/tests' [ 108s] Makefile:398: recipe for target 'check-recursive' failed [ 108s] make[3]: *** [check-recursive] Error 1 [ 108s] make[3]: Leaving directory '/usr/src/packages/BUILD/tests' [ 108s] Makefile:506: recipe for target 'check-recursive' failed [ 108s] make[2]: *** [check-recursive] Error 1 [ 108s] make[2]: Leaving directory '/usr/src/packages/BUILD' [ 108s] Makefile:797: recipe for target 'check' failed [ 108s] make[1]: *** [check] Error 2 [ 108s] make[1]: Leaving directory '/usr/src/packages/BUILD' [ 108s] dh_auto_test: make -j1 check returned exit code 2 [ 108s] debian/rules:17: recipe for target 'build' failed [ 108s] make: *** [build] Error 2 [ 108s] dpkg-buildpackage: error: debian/rules build gave error exit status 2 [ 108s] [ 108s] lamb68 failed "build osmo-mgw_1.0.2.20171009.dsc" at Mon Oct 9 02:58:31 UTC 2017. [ 108s] [ 108s] ### VM INTERACTION START ### [ 111s] [ 103.628587] reboot: Power down [ 111s] ### VM INTERACTION END ### [ 111s] [ 111s] lamb68 failed "build osmo-mgw_1.0.2.20171009.dsc" at Mon Oct 9 02:58:34 UTC 2017. [ 111s]
The builds fail consistently on all architectures, so we must have introduced some regression there. I'm assigning it to neels, as he has committed/pushed the most recent changes that co-incide with the time of failure. Please investigate (feel free to reassign in the sysmocom team, if you think somebody else is better suited to fix it), thanks!
Files
Updated by neels over 6 years ago
- Status changed from New to Feedback
- Assignee changed from neels to lynxis
The decisive output is:
[ 105s] Regression tests. [ 105s] [ 105s] 1: legacy_mgcp ok [ 105s] 2: legacy_mgcp-trans skipped (testsuite.at:12) [ 105s] 3: mgcp_client ok [ 105s] 4: mgcp FAILED (testsuite.at:27)
But I cannot reproduce the mgcp test failure.
The OBS builds (still) don't output the test logs and I am not sure how we can obtain them.
Something like the cat-testlogs.sh our jenkins builds are doing would be very helpful.
https://git.osmocom.org/osmo-ci/tree/scripts/cat-testlogs.sh
I would like to redirect to lynxis to probe whether we can get the test logs, and to dexter for an mgcp_test.c fix.
Updated by lynxis over 6 years ago
neels: I've extended the log similiar to libosmocore [1]. I've manual deployed this patch to OBS. [1] or attached log.txt now shows the exact problem.
dexter: can you take a look on the test output?
[1] https://gerrit.osmocom.org/#/c/4212/
[2] https://build.opensuse.org/build/network:osmocom:nitb-split:nightly/Debian_9.0/x86_64/osmo-mgw/_log
Updated by dexter over 6 years ago
- Assignee changed from dexter to lynxis
I am sorry, without further information such as log outputs I can not help much.
Updated by neels over 6 years ago
- Assignee changed from lynxis to neels
dexter, there is a link right there with log outputs, you should have been able to investigate.
Updated by neels over 6 years ago
anyway, here is one again https://build.opensuse.org/package/live_build_log/network:osmocom:nitb-split:nightly/osmo-mgw/xUbuntu_16.10/x86_64
which was also posted on the chat. So it seems like you have other reasons to not look into this.
Updated by dexter over 6 years ago
- Assignee changed from dexter to neels
It seems that it is a problem with the override of sendto(). I also wonder why
the legacy test does not fail since it is very similar to the mgcp test that
fails. If the following solution works out we might want to apply it there
as well.
In mgcp_test.c one can find an override of sendto:
#define MGCP_DUMMY_LOAD 0x23 static int dummy_packets = 0; /* override and forward */ ssize_t sendto(int sockfd, const void *buf, size_t len, int flags, const struct sockaddr *dest_addr, socklen_t addrlen) { typedef ssize_t (*sendto_t)(int, const void *, size_t, int, const struct sockaddr *, socklen_t); static sendto_t real_sendto = NULL; uint32_t dest_host = htonl(((struct sockaddr_in *)dest_addr)->sin_addr.s_addr); int dest_port = htons(((struct sockaddr_in *)dest_addr)->sin_port); if (!real_sendto) real_sendto = dlsym(RTLD_NEXT, "sendto"); if (len == 1 && ((const char *)buf)[0] == MGCP_DUMMY_LOAD ) { fprintf(stderr, "Dummy packet to 0x%08x:%d, msg length %zu\n%s\n\n", dest_host, dest_port, len, osmo_hexdump(buf, len)); dummy_packets += 1; } return real_sendto(sockfd, buf, len, flags, dest_addr, addrlen); }
I have problems to understand this. There is a function pointer real_sendto
that is set to NULL, only to detect a few lines later that it is really NULL
and populate it then with the symbol of the sendto (the real sendto). However,
this happens only once and real_sendto is static. Its either a lengthy process
or it must not happen twice - however.
The interesting part is that the override sendo is counting the transmitted
dummy packets.
After the counting the packets are forwared using the real_sendto. This is
why we see the packets in wireshark. Probably we want to remove this so that
no actual packets leave. At least when I comment the real_sendto out the
test still passes. So that would be a way to get rid of the packets that
OBS probably won't like.
Lets have a look at the log output:
[ 80s] <0010> mgcp_network.c:136 endpoint:5 Failed to send dummy RTP packet. [ 80s] [0;m<0010> mgcp_protocol.c:747 CRCX: endpoint:5 connection successfully created [ 80s] [0;m<0010> mgcp_protocol.c:522 CRCX: creating new connection ... [ 80s] [0;m<0010> mgcp_sdp.c:300 Got media info via SDP: port 5904, payload 97 (GSM-EFR), duration 40, addr 123.12.12.123 [ 80s] [0;mPolicy CB got state 1 on endpoint 1 [ 80s] Dummy packet to 0x7b0c0c7b:5904, msg length 1 [ 80s] 23 [ 80s] [ 80s] <0010> mgcp_network.c:136 endpoint:1 Failed to send dummy RTP packet. [ 80s] [0;m<0010> mgcp_protocol.c:747 CRCX: endpoint:1 connection successfully created [ 80s] [0;m<0010> mgcp_osmux.c:682 All Osmux circuits are in use! [ 80s] [0;m--- expout 2017-10-13 13:15:05.388398199 +0000 [ 80s] +++ /usr/src/packages/BUILD/tests/testsuite.dir/at-groups/4/stdout 2017-10-13 13:15:05.436398197 +0000 [ 80s] @@ -16,12 +16,12 @@ [ 80s] Testing MDCX1 [ 80s] Testing MDCX2 [ 80s] Testing CRCX [ 80s] -Dummy packets: 2 [ 80s] +Dummy packets: 1
We can see that the sendto() override is executed. Then we spot a message
from mgcp_network.c. This could mean that the dummy sendto returned with
an error code. Probably because the real_sendto returned an eror.
The dummy packet transmission terminates on this function:
int mgcp_send_dummy(struct mgcp_endpoint *endp, struct mgcp_conn_rtp *conn) { static char buf[] = { MGCP_DUMMY_LOAD }; int rc; int was_rtcp = 0; OSMO_ASSERT(endp); OSMO_ASSERT(conn); LOGP(DLMGCP, LOGL_DEBUG, "endpoint:%x sending dummy packet...\n", ENDPOINT_NUMBER(endp)); LOGP(DLMGCP, LOGL_DEBUG, "endpoint:%x conn:%s\n", ENDPOINT_NUMBER(endp), mgcp_conn_dump(conn->conn)); rc = mgcp_udp_send(conn->end.rtp.fd, &conn->end.addr, conn->end.rtp_port, buf, 1); if (rc == -1) goto failed; if (endp->tcfg->omit_rtcp) return rc; was_rtcp = 1; rc = mgcp_udp_send(conn->end.rtcp.fd, &conn->end.addr, conn->end.rtcp_port, buf, 1); if (rc >= 0) return rc; failed: LOGP(DLMGCP, LOGL_ERROR, "endpoint:%x Failed to send dummy %s packet.\n", ENDPOINT_NUMBER(endp), was_rtcp ? "RTCP" : "RTP"); return -1; }
Here we can see that the mgcp_udp_send() should normally execute twice. This
matches the observed behaviour. We see the output from the sendto() override,
wich executes ok until the point where it outputs. We also count the
dummy_packets one up.
Probably the real_sendto at the end of the sendto() override fails. This
eventually leads into a bad return code of mgcp_udp_send() and we will send
to failed: This is why we see the error message immediately after. By this
we omit the second mgcp_udp_send(), which would normally give us the second
dummy dummy packet.
Since the test still passes when the real_sendto is removed we possibly may
try this. A patch is available for review: https://gerrit.osmocom.org/4260
Updated by neels over 6 years ago
- Assignee changed from neels to dexter
dexter wrote:
I have problems to understand this. There is a function pointer real_sendto
that is set to NULL, only to detect a few lines later that it is really NULL
and populate it then with the symbol of the sendto (the real sendto). However,
this happens only once and real_sendto is static. Its either a lengthy process
or it must not happen twice - however.
The code doesn't want to load the symbol using the dynamic linker more than once.
What I, too, fail to understand is why the test goes to all this trouble when
removing this doesn't seem to affect the test in the slightest.
The interesting part is that the override sendo is counting the transmitted
dummy packets. [...] we see the error message [...] would normally give us the second
dummy dummy packet.
Indeed, this should be the actual cause.
Let's try to clarify why the test was even sending out packets in the first place.
Ask zecke??
Updated by dexter over 6 years ago
I will drop the dummy packets completely as demanded in #2574. I wrote Holger an Email since we potentially have the same problem in legacy_mgcp as well.
Updated by zecke over 6 years ago
neels wrote:
The code doesn't want to load the symbol using the dynamic linker more than once.
What I, too, fail to understand is why the test goes to all this trouble when
removing this doesn't seem to affect the test in the slightest.
We want to observe the side-effect (and we don't want to have a "network interface" indirection in the code).
Updated by dexter over 6 years ago
- Status changed from In Progress to Feedback
- Assignee changed from dexter to neels
neels Is the problem still present?
Updated by laforge over 6 years ago
Philipp,
On Mon, Oct 23, 2017 at 09:07:42PM +0000, dexter [REDMINE] wrote:
neels Is the problem still present?
would you please simply look at the OBS status/output yourself rathern
than assigning this rather mundane task to a colleague?
Updated by neels over 6 years ago
- Assignee changed from neels to zecke
- Priority changed from Urgent to Low
- % Done changed from 0 to 100
dexter wrote:
neels Is the problem still present?
Patch is merged, and osmo-mgw builds successfully, as can be seen at https://build.opensuse.org/project/show/network:osmocom:nitb-split:nightly
Do we have a conclusion yet on whether we should re-add sending of packets in the test?
Personally, I would keep them disabled, and refer to Harald's TTCN3 setup to test for network operation related leaks and side effects.
zecke, would that be ok from your viewpoint?
Updated by laforge about 6 years ago
- Status changed from Feedback to Resolved
- Assignee changed from zecke to neels
- % Done changed from 90 to 100
no feedback about re-enabling the sendign in the unit tests, let's close this.