Project

General

Profile

Actions

Bug #1869

closed

osmo-trx binary cannot be moved to similar CPU

Added by neels over 7 years ago. Updated almost 7 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
12/05/2016
Due date:
% Done:

100%

Spec Reference:

Description

Built osmo-trx on a jenkins build slave and installed the resulting binary
on the gsm-tester APU, both are amd64 CPUs, but osmo-trx on the APU fails
with SIGILL = Illegal Instruction.

The CPUs have slightly differing CPU feature sets, which seems to be the cause of this problem.

Binaries should be interoperable within the same CPU family,
using run-time checks to determine which CPU features to use.

build slave CPU info:

# uname -a
Linux sysmocom-office1.am93.sysmocom.de 3.13.0-93-generic #140-Ubuntu SMP Mon Jul 18 21:21:05 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

# cat /proc/cpuinfo
processor    : 0
vendor_id    : GenuineIntel
cpu family    : 6
model        : 63
model name    : Intel(R) Xeon(R) CPU E5-1660 v3 @ 3.00GHz
stepping    : 2
microcode    : 0x2e
cpu MHz        : 1200.000
cache size    : 20480 KB
physical id    : 0
siblings    : 16
core id        : 0
cpu cores    : 8
apicid        : 0
initial apicid    : 0
fpu        : yes
fpu_exception    : yes
cpuid level    : 15
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid
bogomips    : 5999.38
clflush size    : 64
cache_alignment    : 64
address sizes    : 46 bits physical, 48 bits virtual
power management:
[...]

APU CPU info:

# uname -a
Linux apu-roh 3.16.0-4-amd64 #1 SMP Debian 3.16.36-1+deb8u2 (2016-10-19) x86_64 GNU/Linux

# cat /proc/cpuinfo 
processor    : 0
vendor_id    : AuthenticAMD
cpu family    : 20
model        : 2
model name    : AMD G-T40E Processor
stepping    : 0
microcode    : 0x5000101
cpu MHz        : 800.000
cache size    : 512 KB
physical id    : 0
siblings    : 2
core id        : 0
cpu cores    : 2
apicid        : 0
initial apicid    : 0
fpu        : yes
fpu_exception    : yes
cpuid level    : 6
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf pni monitor ssse3 cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch ibs skinit wdt arat hw_pstate npt lbrv svm_lock nrip_save pausefilter vmmcall
bogomips    : 2000.17
TLB size    : 1024 4K pages
clflush size    : 64
cache_alignment    : 64
address sizes    : 36 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate
[...]

Here is a diff of the CPU features, -buildslave +APU:

+3dnowprefetch
 abm
-acpi
-aes
 aperfmperf
 apic
 arat
-arch_perfmon
-avx
-avx2
-bmi1
-bmi2
-bts
 clflush
 cmov
+cmp_legacy
 constant_tsc
+cr8_legacy
 cx16
 cx8
-dca
 de
-ds_cpl
-dtes64
-dtherm
-dts
-eagerfpu
-epb
-ept
-erms
-est
-f16c
-flexpriority
-fma
+extapic
+extd_apicid
 fpu
-fsgsbase
 fxsr
+fxsr_opt
 ht
-ida
-invpcid
+hw_pstate
+ibs
 lahf_lm
+lbrv
 lm
 mca
 mce
+misalignsse
 mmx
+mmxext
 monitor
-movbe
 msr
 mtrr
 nonstop_tsc
 nopl
+npt
+nrip_save
 nx
 pae
 pat
-pbe
-pcid
-pclmulqdq
-pdcm
+pausefilter
 pdpe1gb
-pebs
 pge
-pln
 pni
 popcnt
 pse
 pse36
-pts
-rdrand
 rdtscp
 rep_good
 sep
-smep
-smx
-ss
+skinit
 sse
 sse2
-sse4_1
-sse4_2
+sse4a
 ssse3
+svm
+svm_lock
 syscall
-tm
-tm2
-tpr_shadow
 tsc
-tsc_adjust
-tsc_deadline_timer
 vme
-vmx
-vnmi
-vpid
-x2apic
-xsave
-xsaveopt
-xtopology
-xtpr
+vmmcall
+wdt


Files

core.tgz core.tgz 5.82 MB neels, 04/25/2017 11:31 AM

Related issues

Related to Cellular Network Infrastructure - Bug #1928: nightly packages: osmo-trx fails for missing sqlite3.h and/or debian/rules errorClosed01/26/2017

Actions
Blocks OsmoBTS - Feature #1849: osmo-bts-trx integration to osmo-gsm-testerClosedroh11/18/2016

Actions
Actions #1

Updated by neels over 7 years ago

  • Blocks Feature #1849: osmo-bts-trx integration to osmo-gsm-tester added
Actions #2

Updated by neels over 7 years ago

copying some info from #1849, for the record:

root@apu-roh:/var/tmp/osmo-gsm-tester/tmp.DE0ZLpRaLs/20161205102906-StandardTestScenario-osmoTrxBTS/tmp.DE0ZLpRaLs-osmo-bts-trx# LD_LIBRARY_PATH="$PWD/lib" gdb bin/osmo-trx osmo-trx-core-26021
[...]
Core was generated by `/var/tmp/osmo-gsm-tester/tmp.DE0ZLpRaLs/20161205102906-StandardTestScenario-osm'.
Program terminated with signal SIGILL, Illegal instruction.
#0  _mm_shuffle_ps (__mask=34, __B=..., __A=...) at /usr/lib/gcc/x86_64-linux-gnu/4.9/include/xmmintrin.h:743
743      return (__m128) __builtin_ia32_shufps ((__v4sf)__A, (__v4sf)__B, __mask);
(gdb) bt
#0  _mm_shuffle_ps (__mask=34, __B=..., __A=...) at /usr/lib/gcc/x86_64-linux-gnu/4.9/include/xmmintrin.h:743
#1  sse_conv_real4 (x=0xdb4e58, h=0xd5bfd0, y=0xd95020, len=41) at convolve.c:60
#2  0x0000000000446434 in convolve_real (x=0xdb4e70, x_len=41, h=0xd5bfd0, h_len=4, y=0xd95020, y_len=41, start=0, len=41, step=1, offset=0) at convolve.c:564
#3  0x000000000043dc10 in convolve (x=x@entry=0xd83500, h=h@entry=0xd8e9c0, y=0xd5b170, y@entry=0x0, spanType=spanType@entry=START_ONLY, start=start@entry=0, len=41, len@entry=0, step=1, offset=0)
    at sigProcLib.cpp:476
#4  0x000000000043edf9 in modulateBurstBasic (sps=1, guard_len=<optimized out>, bits=...) at sigProcLib.cpp:1180
#5  modulateBurst (wBurst=..., guardPeriodLength=<optimized out>, sps=1, emptyPulse=<optimized out>) at sigProcLib.cpp:1196
#6  0x000000000044241b in generateRACHSequence (sps=<optimized out>) at sigProcLib.cpp:1629
#7  sigProcLibSetup () at sigProcLib.cpp:2128
#8  0x000000000041ca04 in Transceiver::init (this=this@entry=0xd95b60, filler=1, rtsc=0, rach_delay=0, edge=<optimized out>) at Transceiver.cpp:181
#9  0x000000000040f4db in makeTransceiver (config=config@entry=0x7ffdf146aa10, radio=radio@entry=0xd82970) at osmo-trx.cpp:287
#10 0x000000000040b991 in main (argc=<optimized out>, argv=<optimized out>) at osmo-trx.cpp:540
(gdb) 
root@apu-roh:~# osmo-trx
linux; GNU C++ version 4.9.2; Boost_105500; UHD_003.009.005-0-unknown

opening configuration table from path :memory:
Config Settings
   Log Level............... NOTICE
   Device args............. 
   TRX Base Port........... 5700
   TRX Address............. 127.0.0.1
   Channels................ 1
   Tx Samples-per-Symbol... 4
   Rx Samples-per-Symbol... 1
   EDGE support............ Disabled
   Reference............... Internal
   C0 Filler Table......... Disabled
   Multi-Carrier........... Disabled
   Diversity............... Disabled
   Tuning offset........... 0
   RSSI to dBm offset...... 0
   Swap channels........... 0

-- Detected Device: B210
-- Operating over USB 2.
-- Initialize CODEC control...
-- Initialize Radio control...
-- Performing register loopback test... pass
-- Performing register loopback test... pass
-- Performing CODEC loopback test... pass
-- Performing CODEC loopback test... pass
-- Asking for clock rate 16.000000 MHz... 
-- Actually got clock rate 16.000000 MHz.
-- Performing timer loopback test... pass
-- Performing timer loopback test... pass
-- Setting master clock rate selection to 'automatic'.
-- Asking for clock rate 26.000000 MHz... 
-- Actually got clock rate 26.000000 MHz.
-- Performing timer loopback test... pass
-- Performing timer loopback test... pass
-- Setting B210 4/1 Tx/Rx SPS
Illegal instruction
Actions #3

Updated by zecke over 7 years ago

https://gcc.gnu.org/onlinedocs/gcc-6.2.0/gcc/Common-Function-Attributes.html#Common-Function-Attributes http://pasky.or.cz/dev/glibc/ifunc.c seems to be the right way forward. ifunc used to have issues on ARM (and MIPS?) but I think they were resolved years ago.

The idea is that you provide a dispatch function with "ifunc" and it returns the function to use from then. E.g. check if SSE is present and then move around. In terms of unit testing one would need to be able to test all versions of it. An example is here: http://pasky.or.cz/dev/glibc/ifunc.c

Actions #4

Updated by neels over 7 years ago

A quick workaround could be:
./configure --without-sse

The osmo-trx build has a number of SSE levels it checks for, and also two more CPU features: MMX and AVX.
This obviously makes no sense when building for another CPU, either.

checking whether mmx is supported... yes
checking whether sse is supported... yes
checking whether sse2 is supported... yes
checking whether sse3 is supported... yes
checking whether ssse3 is supported... yes
checking whether sse4.1 is supported... yes
checking whether sse4.2 is supported... yes
checking whether avx is supported by processor... yes
checking for x86-AVX xgetbv 0x00000000 output... 7:0
checking whether avx is supported by operating system... yes
checking whether C compiler accepts -mmmx... yes
checking whether C compiler accepts -msse... yes
checking whether C compiler accepts -msse2... yes
checking whether C compiler accepts -msse3... yes
checking whether C compiler accepts -mssse3... yes
checking whether C compiler accepts -msse4.1... yes
checking whether C compiler accepts -msse4.2... yes
checking whether C compiler accepts -mavx... yes

ifunc looks like a nifty way forward, but it's hard to grok.
At first the example there looks like twice the same function... is the
attribute there like passing an -msse option for just one function?

So we need those for all six levels of SSE as well as MMX and AVX?

Actually, it seems that only SSE is used, no HAVE_MMX nor HAVE_AVX exist in any src files.
So far I fail to see where the MMX, SSE and AVX checks actually come from in osmo-trx/configure.ac.

For the time being I will try the --without-sse workaround for the gsm-tester,
awaiting an ifunc patch from some future bold contender.

Actions #5

Updated by laforge over 7 years ago

On Tue, Dec 06, 2016 at 12:18:45AM +0000, neels [REDMINE] wrote:

The osmo-trx build has a number of SSE levels it checks for, and also
two more CPU features: MMX and AVX. This obviously makes no sense
when building for another CPU, either.

I think this approach is completely broken. The standard method for
these kindof optimizations is to check for them at runtime. This is
what every media player is doing probably already for a decdae. This is
also what (IIRC) the lower-layer gnueradio libraries for signal
processing (like fftw3) are doing.

Holger has given some indication how this can be done

Now the bigger question is, who of the people involved with osmo-trx are
intrested in fixing this?

Meanwhile, the optimizations will all have to be disabled for the binary
packages, and osmo-trx should print a big fat warning every time it is
started about the fact that none of the optimziations are used in the
binary packages, possibly even pointing to this bug.

Regards,
Harald
--
- Harald Welte <> http://laforge.gnumonks.org/ ============================================================================
"Privacy in residential applications is a desirable marketing option."
(ETSI EN 300 175-7 Ch. A6)

Actions #6

Updated by laforge about 7 years ago

  • Priority changed from Normal to High
Actions #7

Updated by zecke about 7 years ago

  • Related to Bug #1928: nightly packages: osmo-trx fails for missing sqlite3.h and/or debian/rules error added
Actions #8

Updated by zecke about 7 years ago

So basicly this should be a runtime check and these days the "ifunction" approach is working. At first call the linked will call a function that will return the real function to call. In there we can use the cpuid check to return an optimized one or not. With "newer" (i would guess gcc4) gcc's we can also enable per function cpu tuning....

This means you could have three copies of the C code and optimize them with different compilation flags (e.g. enable auto vectorization for SSE4 or such) and then pick this version at runtime.

Actions #9

Updated by laforge about 7 years ago

  • Assignee set to dexter
Actions #10

Updated by laforge about 7 years ago

  • Priority changed from High to Normal
Actions #11

Updated by dexter about 7 years ago

  • Status changed from New to In Progress

In terms of architecture dependant code we have to watch out for
the following ifdef/define-constants:

  • HAVE_MMX
    (unused?)
  • HAVE_SSE
    (unused?)
  • HAVE_SSE2
    (unused?)
  • HAVE_SSE3
    ./Transceiver52M/x86/convolve.c
    ./Transceiver52M/x86/convert.c
  • HAVE_SSSE3
    (unused?)
  • HAVE_SSE4_1
    ./Transceiver52M/x86/convolve.c
    ./Transceiver52M/x86/convert.c
  • HAVE_SSE4_2
    (unused?)
  • HAVE_AVX
    (unused?)
Actions #12

Updated by dexter about 7 years ago

To check the support level of SSE, we can simply do:

#include <stdio.h>

/* See also:
 * https://gcc.gnu.org/onlinedocs/gcc-4.8.5/gcc/X86-Built-in-Functions.html */

int main()
{
    if (__builtin_cpu_supports("sse4.1")) {
        printf("SSE4.1 is supported!\n");
    } else
        printf("SSE4.1 is NOT supported!\n");

    if (__builtin_cpu_supports("sse3")) {
        printf("SSE3 is supported!\n");
    } else
        printf("SSE3 is NOT supported!\n");

    return 0;
}
Actions #13

Updated by dexter about 7 years ago

  • Status changed from In Progress to Feedback

Integrating this seemed to be easy on the first look. After examining the part that uses SSE we (neels and me) even found that we can skip having function pointers. An extra If would only add very minimal extra cost.

  • The detection of the SSE level works perfectly fine. On my machine and on the APU - no problem here.
  • Trouble starts when compiling the code on the APU, that should work, but it complains that the platform does not support the SSE features. Even then when the relevant compiler switches (-msse4.1) is set.
  • Also, and that worries me most, is that if we turn on SSE4.1 support, the compiler will apply this to all sources, it will probably infer SSE4.1 based optimizations in other parts of the code and render the result incompatible.

In my opinion, the only way to do this properly is to compile all common parts without SSE, or at least with the lowest SSE level that the is supported by all x64 CPUs. The specific parts would then be splitted into SSE3, SSE4.1 and NO-SSE and compiled separately with the correct SSE-Level. The CPU runtime detection than always selects the right code by either assigning pointers or having an extra IF.

I hope I am wrong here, because splitting it all up will introduce code duplication (how to avoid that?), not to mention the difficulties with automake.

Actions #14

Updated by laforge about 7 years ago

How do the common users of such instruction set specific extensions
solve this? did you check e.g. volk, fftw, ffmpeg, ...?

Actions #15

Updated by laforge about 7 years ago

Hi Dexter,

On Mon, Mar 13, 2017 at 11:45:00AM +0000, dexter [REDMINE] wrote:

Integrating this seemed to be easy on the first look. After examining
the part that uses SSE we (neels and me) even found that we can skip
having function pointers. An extra If would only add very minimal
extra cost.

Where is the problem with function pointers? All the code that is doing
codec-selection that I've seen (e.g. in ffmpeg libavcodec) is doing it
this way, so hence my normal approach would be to do it like the others
do it, who have been doing runtime selection of cpu optimizations for
decades.

For example, see ffmpeg/libavcodec/x86/vp8dsp_init.c

They basically first assign the function pointer to the most generic
implementation, and then check for specific cpu features (in order of
"recentness" such as SSE level) and overwrite those function pointers.
That's code that runs once at startup, and which is fairly simple.

  • The detection of the SSE level works perfectly fine. On my machine
    and on the APU - no problem here.

good.

  • Trouble starts when compiling the code on the APU, that should work,
    but it complains that the platform does not support the SSE features.
    Even then when the relevant compiler switches (-msse4.1) is set.

What is the exact error message[s]? Without those, it's hard to search
for any references and try to help you.

  • Also, and that worries me most, is that if we turn on SSE4.1
    support, the compiler will apply this to all sources, it will probably
    infer SSE4.1 based optimizations in other parts of the code and render
    the result incompatible.

What about not compiling all of the code with that flag, but only the
'codec' part (convolve.c, convert.c, ...)?

In my opinion, the only way to do this properly is to compile all
common parts without SSE, or at least with the lowest SSE level that
the is supported by all x64 CPUs.

sure.

The specific parts would then be splitted into SSE3, SSE4.1 and NO-SSE
and compiled separately with the correct SSE-Level. The CPU runtime
detection than always selects the right code by either assigning
pointers or having an extra IF.

exactly.

I hope I am wrong here, because splitting it all up will introduce
code duplication (how to avoid that?), not to mention the difficulties
with automake.

Where is the code duplication? You have code that is currently
#ifdef'ed in/out based on compile-time detection of SSE support. All
you're doing is splitting this code up in a way that you compile all of
it in all cases, even on "older" CPUs on the build host, and then select
the specific implementation at runtime.

And which difficulties with automake are you referring to? Automake is
not really involved anymore if you compile the code the same way on all
build hosts and only decide at runtime?

Actions #16

Updated by laforge about 7 years ago

Hi Dexter,

On Mon, Mar 13, 2017 at 11:45:00AM +0000, dexter [REDMINE] wrote:

  • Trouble starts when compiling the code on the APU, that should work,
    but it complains that the platform does not support the SSE features.
    Even then when the relevant compiler switches (-msse4.1) is set.

This sounds like you're using "-march=native" or "-mtune=native" which
only makes sense if you want to build code on the actual CPU model on
which you will later run it.

Actions #17

Updated by dexter about 7 years ago

  • % Done changed from 0 to 70

The matching implementation is now selected dynamically during runtime. In order to be sure that the convolution part did not break, I wrote a small test program to compute some testvectors. I compared the results before and after my changes. They match up. I made only minimal changes to the conversion code, so I skipped testing that as well.

The SSE relevant sources are compiled with $(SIMD_FLAGS) now. The sources only contain the SSE implementation and the decision logic to defer to the correct implementation during runtime. That should run fine on non SSE3 / SSE4.1 CPUs, since the decision logic is not vectorize-able. However, we can divide this up further as discussed.

https://gerrit.osmocom.org/2098 buildenv: Turn off native architecture builds
https://gerrit.osmocom.org/2099 cosmetic: Make parameter lists uniform
https://gerrit.osmocom.org/2100 ssedetect: Add runtime CPU detection
https://gerrit.osmocom.org/2101 cosmetic: Add info about SSE support
https://gerrit.osmocom.org/2102 buildenv: Make build CPU invariant
https://gerrit.osmocom.org/2103 cosmetic: remove code duplication
https://gerrit.osmocom.org/2104 Add test program to verify convolution implementation

Actions #18

Updated by laforge about 7 years ago

I think using the following construct it should be possible to have
file-specific compiler flags and separate the sse3 and sse4.1 specific
code from the generic code, while compiling the generic code with
generic compiler optimization flags and the sse code with their
respective optimization.

The general idea is that you build one library for each of the
optimizations, as autocomake can have different CFLAGS for different
libraries (see
https://www.gnu.org/software/automake/manual/html_node/Per_002dObject-Flags.html)
and then link those libraries (if present) into the 'main' library.

See below (untested) esample snippet:

----
if !ARCH_ARM
AM_CFLAGS = -Wall -std=gnu99 -march=native -I${srcdir}/../common

noinst_LTLIBRARIES = libarch.la

if HAVE_SSE3
noinst_LIBRARIES += libarch_sse3.a
libarch_sse3_la_SOURCES = convert_convolve_sse3.c
libarch_sse3_la_CFLAGS = -msse3
libarch_la_LDADD += libarch_sse3.a
endif

if HAVE_SSE41
noinst_LIBRARIES += libarch_sse41.a
libarch_sse41_la_SOURCES = convert_convolve_sse41.c
libarch_sse41_la_CFLAGS = -msse4.1
libarch_la_LDADD += libarch_sse41.a
endif

libarch_la_SOURCES = \
../common/convolve_base.c \
convert.c \
convolve.c


Actions #19

Updated by zecke about 7 years ago

On 16 Mar 2017, at 19:45, laforge [REDMINE] <> wrote:

Issue #1869 has been updated by laforge.

I think using the following construct it should be possible to have
file-specific compiler flags and separate the sse3 and sse4.1 specific
code from the generic code, while compiling the generic code with
generic compiler optimization flags and the sse code with their
respective optimization.

"newer" (I think already gcc 4.x) have two more options:

Actions #20

Updated by laforge about 7 years ago

On Thu, Mar 16, 2017 at 08:48:57PM +0000, zecke [REDMINE] wrote:

"newer" (I think already gcc 4.x) have two more options:

indeed, the following looks promising, and was introduced in gcc-4.4:

====
target (options)
Multiple target back ends implement the target attribute to specify that
a function is to be compiled with different target options than
specified on the command line. This can be used for instance to have
functions compiled with a different ISA (instruction set architecture)
than the default. You can also use the ‘#pragma GCC target’ pragma to
set more than one function to be compiled with specific target options.
See Function Specific Option Pragmas, for details about the ‘#pragma GCC
target’ pragma.

For instance, on an x86, you could declare one function with the
target("sse4.1,arch=core2") attribute and another with
target("sse4a,arch=amdfam10"). This is equivalent to compiling the first
function with -msse4.1 and -march=core2 options, and the second function
with -msse4a and -march=amdfam10 options. It is up to you to make sure
that a function is only invoked on a machine that supports the
particular ISA it is compiled for (for example by using cpuid on x86 to
determine what feature bits and architecture family are used).

int core2_func (void) attribute ((target ("arch=core2")));
int sse3_func (void) attribute ((target ("sse3")));

You can either use multiple strings separated by commas to specify
multiple options, or separate the options with a comma (‘,’) within a
single string. ====

It seems, it is even possible to fully automatize the dispatch to
differrent implementations, see
https://gcc.gnu.org/wiki/FunctionMultiVersioning
but apparently only for gcc 4.8 or later, where Debian stable ships only
4.7, so it is not a feature we should use in our code at this point yet.

Regards,
Harald

Actions #21

Updated by dexter about 7 years ago

  • Status changed from Feedback to Resolved
  • % Done changed from 70 to 100

I have now separated the SSE3 and SSE4.1 code into files. I have used the automake code snipped above, which helped a lot. The build system now detects if the compiler is able to handle the SSE extensions, if not the SSE code is not fed into the compiler at all. Also if --with-sse is missing, no SSE code is compiled.

So, we now have reached full code separation, there should be no longer a risk that the CPU gets wrong opcodes because the ISA does not match up.

See also: https://gerrit.osmocom.org/2134.

Actions #22

Updated by laforge about 7 years ago

On Mon, Mar 20, 2017 at 01:04:33PM +0000, dexter [REDMINE] wrote:

I have now separated the SSE3 and SSE4.1 code into files. I have used
the automake code snipped above, which helped a lot. The build system
now detects if the compiler is able to handle the SSE extensions, if
not the SSE code is not fed into the compiler at all.

great.

Also if --with-sse is missing, no SSE code is compiled.

It is debatable whether that's a good default. I think on x86 we should
generally build all the optimized versions (as far as compiler support
goes) by default. The run-time selection will make sure to fall back on
the non-optimized C implementation whenever needed.

So I think the best is if the default is like above (only on x86, of
course) for most users. If somebody wants to explicitly disable, he can
still use --without-sse or the like.

So, we now have reached full code separation, there should be no
longer a risk that the CPU gets wrong opcodes because the ISA does not
match up.

great. Please write a mail to the OpenBSC list with Tom Tsou in Cc and
ask him to comment (and if possible, approve) the changes. Thanks!

Regards,
Harald

Actions #23

Updated by dexter about 7 years ago

Mail is out, I also have removed the change that makes --with-sse mandatory. Its now the default again, as it was originally.

Actions #24

Updated by neels almost 7 years ago

  • Status changed from Resolved to In Progress
  • Assignee changed from dexter to neels
  • % Done changed from 100 to 80

Re-opening because I still get an "illegal instruction" error on the gsm tester main unit.

DISCLAIMER: I'm not sure whether this is solved by patches waiting in gerrit, for now just reflecting what I actually see, to report resolution later. Fact is I can't successfully run osmo-trx on the APU main unit yet, hence this ticket shall remain open. Assigning to me so far until things are more clear.

Details:

Binary built with:

+ ./configure --prefix=/home/jenkinsdebian8amd64/jenkins/workspace/osmo-gsm-tester_build_osmo-bts-trx/inst-osmo-bts-trx --without-sse

( http://10.9.1.103/view/osmo-gsm-tester/job/osmo-gsm-tester_build_osmo-bts-trx/1/console )

i.e. compiled without SSE so expecting none of those in the binary.

ran on main unit:

root@apu-roh:/var/tmp/osmo-gsm-tester/trial-18/inst/osmo-bts-trx# LD_LIBRARY_PATH="$PWD/lib" bin/osmo-trx -x
linux; GNU C++ version 4.9.2; Boost_105500; UHD_003.009.005-0-unknown

opening configuration table from path :memory:
Config Settings
   Log Level............... NOTICE
   Device args............. 
   TRX Base Port........... 5700
   TRX Address............. 127.0.0.1
   Channels................ 1
   Tx Samples-per-Symbol... 4
   Rx Samples-per-Symbol... 1
   EDGE support............ Disabled
   Reference............... External
   C0 Filler Table......... Disabled
   Multi-Carrier........... Disabled
   Diversity............... Disabled
   Tuning offset........... 0
   RSSI to dBm offset...... 0
   Swap channels........... 0

-- Loading firmware image: /usr/share/uhd/images/usrp_b200_fw.hex...
-- Detected Device: B200
-- Loading FPGA image: /usr/share/uhd/images/usrp_b200_fpga.bin... done
-- Operating over USB 2.
-- Detecting internal GPSDO.... No GPSDO found
-- Initialize CODEC control...
-- Initialize Radio control...
-- Performing register loopback test... pass
-- Performing CODEC loopback test... pass
-- Asking for clock rate 16.000000 MHz... 
-- Actually got clock rate 16.000000 MHz.
-- Performing timer loopback test... pass
-- Setting master clock rate selection to 'automatic'.
-- Asking for clock rate 26.000000 MHz... 
-- Actually got clock rate 26.000000 MHz.
-- Performing timer loopback test... pass
-- Setting B200 4/1 Tx/Rx SPS
Illegal instruction (core dumped)

backtrace:

Program received signal SIGILL, Illegal instruction.
0x0000000000440ccc in mac_real_vec_n (offset=0, step=1, len=4, y=0x6d6790, h=0x69fcf0, x=0x11)
    at ../common/convolve_base.c:46
46    ../common/convolve_base.c: No such file or directory.
(gdb) bt
#0  0x0000000000440ccc in mac_real_vec_n (offset=0, step=1, len=4, y=0x6d6790, h=0x69fcf0, x=0x11)
    at ../common/convolve_base.c:46
#1  _base_convolve_real (x=x@entry=0x6d96c0, x_len=x_len@entry=41, h=h@entry=0x69fcf0, 
    h_len=h_len@entry=4, y=y@entry=0x6d6790, y_len=y_len@entry=41, start=0, len=41, step=1, offset=0)
    at ../common/convolve_base.c:66
#2  0x000000000044115e in convolve_real (x=0x6d96c0, x_len=41, h=0x69fcf0, h_len=4, y=0x6d6790, 
    y_len=41, start=0, len=41, step=1, offset=0) at convolve.c:570
#3  0x0000000000438de0 in convolve (x=x@entry=0x6d03f0, h=h@entry=0x6a6770, y=0x6d6f40, y@entry=0x0, 
    spanType=spanType@entry=START_ONLY, start=start@entry=0, len=41, len@entry=0, step=1, offset=0)
    at sigProcLib.cpp:476
#4  0x0000000000439f59 in modulateBurstBasic (sps=1, guard_len=<optimized out>, bits=...)
    at sigProcLib.cpp:1166
#5  modulateBurst (wBurst=..., guardPeriodLength=<optimized out>, sps=1, emptyPulse=<optimized out>)
    at sigProcLib.cpp:1182
#6  0x000000000043dacb in generateRACHSequence (sps=<optimized out>) at sigProcLib.cpp:1615
#7  sigProcLibSetup () at sigProcLib.cpp:2159
#8  0x0000000000417a94 in Transceiver::init (this=this@entry=0x6f51f0, filler=1, rtsc=0, rach_delay=0, 
    edge=<optimized out>) at Transceiver.cpp:172
#9  0x000000000040a95b in makeTransceiver (config=config@entry=0x7fffffffe820, radio=radio@entry=
    0x6d9470) at osmo-trx.cpp:287
#10 0x0000000000406e11 in main (argc=<optimized out>, argv=<optimized out>) at osmo-trx.cpp:540

according to my git master that would be...

/* Base vector complex-complex multiply and accumulate */
static void mac_real_vec_n(const float *x, const float *h, float *y,
                           int len, int step, int offset)
{
        for (int i = offset; i < len; i += step)    <----- here
                mac_real(&x[2 * i], &h[2 * i], y);
}

Actions #25

Updated by neels almost 7 years ago

attaching core with binaries for later reference...

Actions #26

Updated by neels almost 7 years ago

dmesg said:

traps: osmo-trx[673] trap invalid opcode ip:440ccc sp:7fff83ed5510 error:0 in osmo-trx[400000+68000]

Actions #27

Updated by neels almost 7 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 80 to 100

Today's test shows that osmo-trx as built from our jenkins runs on the gsm-tester main unit.
This is a --without-sse build. There might be room for improvement, but I can run it now.

Actions #28

Updated by laforge almost 7 years ago

  • Status changed from Resolved to Closed
Actions

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)