Project

General

Profile

Actions

Bug #2386

closed

libosmocoding build failure on OBS

Added by msuraev over 6 years ago. Updated over 6 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
07/21/2017
Due date:
% Done:

100%

Spec Reference:

Description

This failure appears/disappears from time to time as it seems to be heavily dependent on sse flags/features enabled on particular machine running the build (which is randomly selected ATM): https://build.opensuse.org/project/monitor/network:osmocom:nightly

The exmple of failing host SSE config details:

osc workerinfo x86_64:build34:1 |grep sse
      <flag>sse</flag>
      <flag>sse2</flag>
      <flag>sse4a</flag>
      <flag>misalignsse</flag>

The example of successful host SSE config details:

osc workerinfo x86_64:lamb10:1 |grep sse
      <flag>sse</flag>
      <flag>sse2</flag>
      <flag>ssse3</flag>
      <flag>sse4_1</flag>
      <flag>sse4_2</flag>
      <flag>sse4a</flag>
      <flag>misalignsse</flag>

I'm not sure what exactly is causing "illegal instruction" error in the difference between those hosts.

Right now builds are mostly OK because the failure-inducing build host configuration seems to be rather rare. Nevertheless, we should fix it by either:

I'm not sure which of those options is better/easier.


Files

libosmocore_ubuntu17.04_i586.log libosmocore_ubuntu17.04_i586.log 541 KB msuraev, 07/21/2017 03:43 PM
obs-libosmocore-fail-log.txt obs-libosmocore-fail-log.txt 1.43 MB full log including /proc/cpuinfo laforge, 11/17/2017 06:46 AM

Related issues

Has duplicate libosmocore - Bug #2645: libosmocodec fails to pass "make test" on some CPUsClosedlaforge11/16/2017

Actions
Actions #1

Updated by fixeria over 6 years ago

I think, there is one possible reason of this failure:

  • Currently, the conv_acc_sse.c is being compiled with both -msse3 -msse4.1 CFLAGS. Some compilers (or different versions) may use the msse4.1 instruction set to optimize the whole byte code, even in the places where they are not actually used. So, one possible solution is to separate the conv_acc_sse.c to conv_acc_sse_3.c and conv_acc_sse_41.c, and then compile them with -msse3 -msse4.1 flags respectively.

Despite I have already tested the library with QEMU, which allows to modify the set SIMD features, provided to guest OS, I'll try to test again. Probably, I missed something or something changed after testing...

Actions #2

Updated by neels over 6 years ago

saw a similar occurence on opensuse build host 'cumulus3' as

/usr/src/packages/BUILD/tests/testsuite.dir/at-groups/9/test-source: line 25: 26402 Illegal instruction $abs_top_builddir/tests/conv/conv_gsm0503_test

Actions #3

Updated by msuraev over 6 years ago

fixeria wrote:

separate the conv_acc_sse.c to conv_acc_sse_3.c and conv_acc_sse_41.c, and then compile them with -msse3 -msse4.1 flags respectively.

What would be the difference between conv_acc_sse_3.c and conv_acc_sse_41.c? Can we just compile conv_acc_sse.c twice - once with "-msse3" and once with "-msse4.1"?

Actions #4

Updated by msuraev over 6 years ago

Actions #5

Updated by laforge over 6 years ago

  • Assignee set to laforge
Actions #6

Updated by laforge over 6 years ago

  • Has duplicate Bug #2645: libosmocodec fails to pass "make test" on some CPUs added
Actions #7

Updated by laforge over 6 years ago

So far, the problem could be observed on build33, build35, build36, build31, build32

Actions #8

Updated by laforge over 6 years ago

The processor on a failing node is an Amd Opteron 2427:

[  243s] cat /proc/cpuinfo
[  243s] processor    : 0
[  243s] vendor_id    : AuthenticAMD
[  243s] cpu family    : 16
[  243s] model        : 8
[  243s] model name    : Six-Core AMD Opteron(tm) Processor 2427
[  243s] stepping    : 0
[  243s] microcode    : 0x1000065
[  243s] cpu MHz        : 2211.336
[  243s] cache size    : 512 KB
[  243s] physical id    : 0
[  243s] siblings    : 1
[  243s] core id        : 0
[  243s] cpu cores    : 1
[  243s] apicid        : 0
[  243s] initial apicid    : 0
[  243s] fpu        : yes
[  243s] fpu_exception    : yes
[  243s] cpuid level    : 6
[  243s] wp        : yes
[  243s] flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow rep_good nopl extd_apicid pni cx16 x2apic popcnt tsc_deadline_timer hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw arat vmmcall
[  243s] bugs        : tlb_mmatch fxsave_leak sysret_ss_attrs
[  243s] bogomips    : 4422.67
[  243s] TLB size    : 1024 4K pages
[  243s] clflush size    : 64
[  243s] cache_alignment    : 64
[  243s] address sizes    : 42 bits physical, 48 bits virtual
[  243s] power management:

See the full log attached. This is a rather old processor, which only implements SSE, SSE2, SSE3 and SSE4a, but not real SSE4/4.1. See http://www.cpu-world.com/Glossary/S/SSE4a.html which might be related.

Actions #9

Updated by laforge over 6 years ago

Mor details on SSE4 / SSE4.1 / SSE4.2 and SSE4a are documented at https://en.wikipedia.org/wiki/SSE4

Most notably, SSE4.1 and 4.1 are not extensions of SSE4, but subsets. Just like SSE4a is a small subset.

Actions #10

Updated by laforge over 6 years ago

I'm creating a Qemu VM with all build dependencies. One cannot use KVM unless the host CPU has the same CPU Flags (which Intel CPUs normally don't), so I'm using regular old-school qemu-system without kvm. The closest approximation to the "failing" CPU is a "-cpu Opteron_G3" flag of qemu.

Actions #11

Updated by laforge over 6 years ago

  • Status changed from New to In Progress
  • % Done changed from 0 to 30
Actions #12

Updated by laforge over 6 years ago

/proc/cpuinfo on the qemu simulated Opteron_G3:

processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 16
model           : 2
model name      : AMD Opteron 23xx (Gen 3 Class Opteron)
stepping        : 3
cpu MHz         : 2591.915
cache size      : 512 KB
physical id     : 0
siblings        : 1
core id         : 0
cpu cores       : 1
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx lm rep_good nopl extd_apicid eagerfpu pni monitor cx16 popcnt hypervisor lahf_lm svm abm sse4a 3dnowprefetch vmmcall
bugs            : tlb_mmatch fxsave_leak sysret_ss_attrs amd_e400
bogomips        : 5183.83
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
Actions #13

Updated by laforge over 6 years ago

  • % Done changed from 30 to 50

The bug can be reproduced in qemu:

Starting program: /root/git/libosmocore/tests/conv/.libs/conv_test 
b[+] Testing: GSM xCCH (non-recursive, flushed, not punctured)
[.] Input length  : ret = 224  exp = 224 -> OK
[.] Output length : ret = 456  exp = 456 -> OK
[.] Pre computed vector checks:
[..] Encoding: OK
t
Program received signal SIGILL, Illegal instruction.
_sse_metrics_k5_n2 (norm=1, paths=0x55555575e420, sums=0x55555575dc50, out=0x55555575dc80, val=<synthetic pointer>) at ./conv_acc_sse_impl.h:280
280             m1 = _mm_sign_epi16(m2, m1);
(gdb) bt
#0  _sse_metrics_k5_n2 (norm=1, paths=0x55555575e420, sums=0x55555575dc50, out=0x55555575dc80, val=<synthetic pointer>) at ./conv_acc_sse_impl.h:280
#1  osmo_conv_sse_metrics_k5_n2 (val=<optimized out>, out=0x55555575dc80, sums=0x55555575dc50, paths=0x55555575e420, norm=1) at conv_acc_sse.c:86
#2  0x00007ffff7bca244 in forward_traverse (dec=dec@entry=0x7fffffffd9a0, 
    seq=seq@entry=0x55555575d030 "\201\201\201\177\201\177\177\201\177\201\177\177\201\201\177\201\177\201\201\201\177\177\177\177\201\177\201\177\201\177\201\201\201\177\201\177\177\177\201\177\201\177\177\177\177\201\201\201\201\201\201\201\177\177\177\177\201\201\201\177\177\201\201\201\177\177\177\177\177\201\177\177\177\177\177\201\177\201\177\177\177\201\201\201\201\201\177\177\201\177\201\177\201\177\201\201\201\177\201\177\201\201\201\201\177\201\201\177\201\177\201\201\201\177\201\177\177\177\177\201\177\177\177\201\177\201\201\177\201\201\201\177\201\177\201\201\177\177\201\201", '\177' <repeats 12 times>, "\201\201\177\201\201\201\201\177\201\201\177\177\201\177\177\177\201\201\201\201\201\201\177\201\177\177\177\177\201\177\201\201\201\177\177\177\177\201\177\201\201\177\177\177\177\177\177\177"...) at conv_acc.c:615
#3  0x00007ffff7bca952 in conv_decode (term=0, len=224, out=0x55555575c820 "\001\001\001", punc=<optimized out>, 
    seq=0x55555575d030 "\201\201\201\177\201\177\177\201\177\201\177\177\201\201\177\201\177\201\201\201\177\177\177\177\201\177\201\177\201\177\201\201\201\177\201\177\177\177\201\177\201\177\177\177\177\201\201\201\201\201\201\201\177\177\177\177\201\201\201\177\177\201\201\201\177\177\177\177\177\201\177\177\177\177\177\201\177\201\177\177\177\201\201\201\201\201\177\177\201\177\201\177\201\177\201\201\201\177\201\177\201\201\201\201\177\201\201\177\201\177\201\201\201\177\201\177\177\177\177\201\177\177\177\201\177\201\201\177\201\201\201\177\201\177\201\201\177\177\201\201", '\177' <repeats 12 times>, "\201\201\177\201\201\201\201\177\201\201\177\177\201\177\177\177\201\201\201\201\201\201\177\201\177\177\177\177\201\177\201\201\201\177\177\177\177\201\177\201\201\177\177\177\177\177\177\177"..., dec=0x7fffffffd9a0)
    at conv_acc.c:639
#4  osmo_conv_decode_acc (code=0x55555575ad80 <gsm0503_xcch>, input=<optimized out>, output=0x55555575c820 "\001\001\001") at conv_acc.c:707
#5  0x00007ffff7bc7a15 in osmo_conv_decode (code=0x55555575ad80 <gsm0503_xcch>, 
    input=input@entry=0x55555575d030 "\201\201\201\177\201\177\177\201\177\201\177\177\201\201\177\201\177\201\201\201\177\177\177\177\201\177\201\177\201\177\201\201\201\177\201\177\177\177\201\177\201\177\177\177\177\201\201\201\201\201\201\201\177\177\177\177\201\201\201\177\177\201\201\201\177\177\177\177\177\201\177\177\177\177\177\201\177\201\177\177\177\201\201\201\201\201\177\177\201\177\201\177\201\177\201\201\201\177\201\177\201\201\201\201\177\201\201\177\201\177\201\201\201\177\201\177\177\177\177\201\177\177\177\201\177\201\201\177\201\201\201\177\201\177\201\201\177\177\201\201", '\177' <repeats 12 times>, "\201\201\177\201\201\201\201\177\201\201\177\177\201\177\177\177\201\201\201\201\201\201\177\201\177\177\177\177\201\177\201\201\201\177\177\177\177\201\177\201\201\177\177\177\177\177\177\177"..., 
    output=output@entry=0x55555575c820 "\001\001\001") at conv.c:615
#6  0x000055555555646b in do_check (test=0x7fffffffdb20) at conv/conv.c:84
#7  0x000055555555619d in main (argc=<optimized out>, argv=<optimized out>) at conv/conv_test.c:275

The illegal instruction is:

0x7ffff7bcb85b <osmo_conv_sse_metrics_k5_n2+75>         psignw 0x10(%rsi),%xmm0

PSIGNW is a SSSE3 instruction, and the CPU clearly advertises only SSE, SSE2 and SSE4A, but not SSSE3.

Actions #14

Updated by laforge over 6 years ago

  • % Done changed from 50 to 80
Actions #15

Updated by laforge over 6 years ago

  • Status changed from In Progress to Closed
  • % Done changed from 80 to 100
Actions #16

Updated by laforge over 6 years ago

Actions

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)