https://osmocom.org/https://osmocom.org/favicon.ico?16647414092020-03-16T20:34:05ZOpen Source Mobile Communicationsosmo-remsim - Bug #4456: SIGABRT while opening USB device in osmo-remsim-client-st2 on OpenWRT with muslhttps://osmocom.org/issues/4456?journal_id=177432020-03-16T20:34:05Zlaforge
<ul></ul><p>ok, this seems to related to thread-local-storage. I remember we had hit a gcc bug about this some time ago (<a class="issue tracker-1 status-3 priority-2 priority-default closed" title="Bug: vty tests fails on arm (raspberry pi) (Resolved)" href="https://osmocom.org/issues/4062">#4062</a>)<br /><pre>
(gdb) frame 10
#10 0xb6ef0f30 in osmo_usb_added_cb (fd=10, events=<optimized out>, user_data=0x0) at osmo_libusb.c:86
86 struct osmo_fd *ofd = talloc_zero(OTC_GLOBAL, struct osmo_fd);
(gdb) p OTC_GLOBAL
Cannot find thread-local storage for Thread 1740.1740, shared library /home/laforge/openwrt/scripts/../staging_dir/target-arm_cortex-a8+vfpv3_musl_eabi/root-omap/usr/lib/libosmocore.so.12:
Remote target failed to process qGetTLSAddr request
</pre></p>
<p>Given that this build was created using gcc-7.3.0, we would actually expect it to have fixed thread-local-storage in shared libraries. weird...</p> osmo-remsim - Bug #4456: SIGABRT while opening USB device in osmo-remsim-client-st2 on OpenWRT with muslhttps://osmocom.org/issues/4456?journal_id=177452020-03-16T20:34:12Zlaforge
<ul><li><strong>Related to</strong> <i><a class="issue tracker-1 status-3 priority-2 priority-default closed" href="/issues/4062">Bug #4062</a>: vty tests fails on arm (raspberry pi)</i> added</li></ul> osmo-remsim - Bug #4456: SIGABRT while opening USB device in osmo-remsim-client-st2 on OpenWRT with muslhttps://osmocom.org/issues/4456?journal_id=177462020-03-16T20:41:50Zlaforge
<ul></ul><p>Attempting to use the workaround of using <code>-mtls-dialect=gnu2</code> when building libosmocore should solve the problem.</p>
<p>....but:<br /><pre>
Error relocating /usr/lib/libosmogsm.so.13: unsupported relocation type 13
Error relocating /usr/lib/libosmogsm.so.13: unsupported relocation type 13
Error relocating /usr/lib/libosmogsm.so.13: unsupported relocation type 13
</pre></p>
<p>It seems that the dynamic linker used by musl doesn't support this at all. I'm a bit at a loss, about (I think) 20 years after introduction of thread-local storage. <strong>sigh</strong></p> osmo-remsim - Bug #4456: SIGABRT while opening USB device in osmo-remsim-client-st2 on OpenWRT with muslhttps://osmocom.org/issues/4456?journal_id=177472020-03-16T22:22:36Zdalias
<ul></ul><p>The TLS thing looks like it might be a red herring. GDB is just unable/unwilling to find TLS by calling __tls_get_addr in the tracee and wants a libthreaddb.so coerced into the tracee's memory space, which we don't have for musl.</p>
<p>With that said, all TLS access in musl, whether you use TLSDESC or not, is async-signal-safe and always has been. Moreover, ARM TLSDESC has been supported since 1.1.21, so you must be using a rather old version. There were some bugs affecting TLS allocation/alignment on ARM that were fixed in the past few years, so it might be one of them that's affecting you; I'd try upgrading and see. But it's also plausible that it's something completely different.</p>
<p>If you can't upgrade and just need to patch, let me know and I'll try to find the relevant commits.</p> osmo-remsim - Bug #4456: SIGABRT while opening USB device in osmo-remsim-client-st2 on OpenWRT with muslhttps://osmocom.org/issues/4456?journal_id=177482020-03-16T22:37:43Zdalias
<ul></ul><p>Actually I forget whether libthreaddb.so is loaded into the tracee/target or on the host running gdb, but in either case we don't have one for musl yet, and if we implement one it will just use generic public interfaces like __tls_get_addr that gdb would ideally already be using itself. Getting gdb to do decent multithreaded debugging without it is a bit of a battle but it works with moderate functionality.</p> osmo-remsim - Bug #4456: SIGABRT while opening USB device in osmo-remsim-client-st2 on OpenWRT with muslhttps://osmocom.org/issues/4456?journal_id=177492020-03-16T23:18:43Zlynxis
<ul></ul><p><a class="user active" href="https://osmocom.org/users/7">laforge</a> which package feed did you use? I would like to reproduce it. On which platform do you tried it?</p> osmo-remsim - Bug #4456: SIGABRT while opening USB device in osmo-remsim-client-st2 on OpenWRT with muslhttps://osmocom.org/issues/4456?journal_id=177502020-03-17T08:43:55Zlaforge
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>In Progress</i></li><li><strong>% Done</strong> changed from <i>0</i> to <i>20</i></li></ul><p>dalias wrote:</p>
<blockquote>
<p>The TLS thing looks like it might be a red herring.</p>
</blockquote>
<p>it might, but given that we have had the exact same problem at the exact same part of the code on other ARM32 platforms before (See <a class="issue tracker-1 status-3 priority-2 priority-default closed" title="Bug: vty tests fails on arm (raspberry pi) (Resolved)" href="https://osmocom.org/issues/4062">#4062</a>), it would be quite a coincidence. And I'm referring to the actual bug causing the SIGABRT while accessing __thread variables in shared libraries, not to the problems gdb has resolving their address.</p>
<blockquote>
<p>GDB is just unable/unwilling to find TLS by calling __tls_get_addr in the tracee and wants a libthreaddb.so coerced into the tracee's memory space, which we don't have for musl.</p>
</blockquote>
<p>Yes, I'm aware of the fact that gdb not resolving the thread-local address is a separate issue (and not the source of my complaint)</p>
<blockquote>
<p>With that said, all TLS access in musl, whether you use TLSDESC or not, is async-signal-safe and always has been. Moreover, ARM TLSDESC has been supported since 1.1.21, so you must be using a rather old version.</p>
</blockquote>
<p>It is 1.1.20 on this platform (and AM335x phyCORE based custom board).</p>
<blockquote>
<p>There were some bugs affecting TLS allocation/alignment on ARM that were fixed in the past few years, so it might be one of them that's affecting you; I'd try upgrading and see. But it's also plausible that it's something completely different.</p>
</blockquote>
<p>I'm not at liberty to upgrade the base OS, but I could try it just to see if we can narrow it down.</p>
<blockquote>
<p>If you can't upgrade and just need to patch, let me know and I'll try to find the relevant commits.</p>
</blockquote>
<p>Thanks a lot for your quick feedback here. I'll let you know if we have to resort to that.</p>
<p>The fun part is that the target program is not multi-threaded at all. It's just that our core library (libosmocore) is prepared for multi-threaded programs by declaring various global state (like the talloc contexts; talloc must have one of those per thread) as thread-local.</p> osmo-remsim - Bug #4456: SIGABRT while opening USB device in osmo-remsim-client-st2 on OpenWRT with muslhttps://osmocom.org/issues/4456?journal_id=177512020-03-17T08:47:25Zlaforge
<ul></ul><p>lynxis wrote:</p>
<blockquote>
<p><a class="user active" href="https://osmocom.org/users/7">laforge</a> which package feed did you use? I would like to reproduce it. On which platform do you tried it?</p>
</blockquote>
<p>this is a phycore-am335x based board (you probably know which one). No official package feeds are used, it's a fully custom build.</p> osmo-remsim - Bug #4456: SIGABRT while opening USB device in osmo-remsim-client-st2 on OpenWRT with muslhttps://osmocom.org/issues/4456?journal_id=184002020-05-21T19:03:41Zlaforge
<ul><li><strong>% Done</strong> changed from <i>20</i> to <i>40</i></li></ul><p>After upgrading musl to 1.1.23, I am still seeing a SIGABRT problem in this code path.</p>
<p>It turns out that modern versions of libtalloc randomize the magic "key" by which they identify every chunk of talloc memory. This presumably is a mechanism to make it harder to attack, as every time you start a process, that magic value changes. Initialization of this <br />talloc_magic' happens inside <code>talloc_lib_init()</code>, which is an <code>__attribute__((constructor))</code> function.</p>
Now in libosmocore, we use talloc from within other <code>__attribute__((constructor))</code> functions, particularly also <code>osmo_ctx_init()</code> here in this particular codepath. So if the constructors are called in the "wrong" order (libosmocore before talloc), then we get the following sequence of events:
<ol>
<li>libosmocore allocating some memory object in talloc. Talloc uses the compiled-in default TALLOC_MAGIC value for identifying those chunks</li>
<li>talloc constructor changing the magic value (in a global variable) to some randomized value</li>
<li>any later follow-up code that wants to allocate memory (using a context/chunk with the old magic value) will <code>abort()</code> as the magic value is not what it expects</li>
</ol>
<p>This problem is aggravated by the fact that talloc_lib_init() will unconditionally use a new random value if called repeatedly, rather than checking if the current value is already a random value. In other words: it's not safe to simply call talloc_lib_init() ahead of any memory allocation function in a constructor.</p>
<p>Interestingly, gcc supports a <em>priority</em> value associated with constructor functions, where a low value means high priority and a high value means low priority: <a class="external" href="https://gcc.gnu.org/onlinedocs/gcc-4.7.0/gcc/Function-Attributes.html">https://gcc.gnu.org/onlinedocs/gcc-4.7.0/gcc/Function-Attributes.html</a></p>
<p>So I will try to use a large numeric priority value in libosmocore and see if that adresses the problem.</p> osmo-remsim - Bug #4456: SIGABRT while opening USB device in osmo-remsim-client-st2 on OpenWRT with muslhttps://osmocom.org/issues/4456?journal_id=184012020-05-21T20:03:09Zlaforge
<ul></ul><p>I tried it with <code>__attribute__((constructor(65535)))</code> in libosmocore. However, the bug still persists.</p>
<p>To my big surprise, I couldn't find any notion of priority handling in <a class="external" href="https://git.musl-libc.org/cgit/musl/tree/ldso/dynlink.c">https://git.musl-libc.org/cgit/musl/tree/ldso/dynlink.c</a> - could it be they simply ignore the gcc specification of this attribute?</p> osmo-remsim - Bug #4456: SIGABRT while opening USB device in osmo-remsim-client-st2 on OpenWRT with muslhttps://osmocom.org/issues/4456?journal_id=184022020-05-21T20:33:28Zlaforge
<ul></ul><p>I submitted my findings to <a class="external" href="https://www.openwall.com/lists/musl/2020/05/21/7">https://www.openwall.com/lists/musl/2020/05/21/7</a></p>
<p>Meanwhile, I hacked an explicit call to osmo_ctx_init() into the main function of osmo-remsim-client, making libosmocore re-allocate the contexts at that point. This gets me beyond the talloc-abort, but it's of course not a solution - particularly since musl is hostile towards people trying to implement quirks by not offering a <i>MUSL</i> or the like.</p> osmo-remsim - Bug #4456: SIGABRT while opening USB device in osmo-remsim-client-st2 on OpenWRT with muslhttps://osmocom.org/issues/4456?journal_id=184032020-05-21T20:39:03Zlaforge
<ul></ul><p>Also see <a class="external" href="https://github.com/bminor/glibc/blob/e3022f4bcd69eb9f103a6de626a1e9e343fc7ada/elf/dl-init.c#L109">https://github.com/bminor/glibc/blob/e3022f4bcd69eb9f103a6de626a1e9e343fc7ada/elf/dl-init.c#L109</a></p>
<pre>
/* Stupid users forced the ELF specification to be changed. It now
says that the dynamic loader is responsible for determining the
order in which the constructors have to run. The constructors
for all dependencies of an object must run before the constructor
for the object itself. Circular dependencies are left unspecified.
This is highly questionable since it puts the burden on the dynamic
loader which has to find the dependencies at runtime instead of
letting the user do it right. Stupidity rules! */
</pre>
<p>clearly the author of those lines is very unhappy, but the spec is the spec...</p> osmo-remsim - Bug #4456: SIGABRT while opening USB device in osmo-remsim-client-st2 on OpenWRT with muslhttps://osmocom.org/issues/4456?journal_id=184042020-05-21T20:49:25Zlaforge
<ul></ul><p>Ok, already in 1997 the specification (<a href="http://www.sco.com/developers/devspecs/gabi41.pdf" class="external">System V Application Binary Interface, Edition 4.1</a>) on this was clear:</p>
<pre>
Initialization and Termination Functions
After the dynamic linker has built the process image and performed the reloca-
tions, each shared object gets the opportunity to execute some initialization code.
All shared object initializations happen before the executable file gains control.
Before the initialization code for any object A is called, the initialization code for
any other objects that object A depends on are called. For these purposes, an
object A depends on another object B, if B appears in A’s list of needed objects
(recorded in the DT_NEEDED entries of the dynamic structure). The order of ini-
tialization for circular dependencies is undefined.
</pre>
So the constructor of libtalloc must be called before the constructor of libosmocore, as long as there is
<ul>
<li>no circular dependency (there certainly is not)</li>
<li>libtalloc appears in libosmcoore's list of needed objects (DT_NEEDED) which is true.</li>
</ul>
<p>So even without any priority, the constructors must be called in the intuitively correct order. So to me this still is a linker bug.</p> osmo-remsim - Bug #4456: SIGABRT while opening USB device in osmo-remsim-client-st2 on OpenWRT with muslhttps://osmocom.org/issues/4456?journal_id=184052020-05-21T20:55:48Zlaforge
<ul></ul><p>the dependency information seems quite [obviously] right: libtalloc doesn't depend on libosmocore, but libosmocore depends on libtalloc.<br /><pre>
# ldd ./libosmocore.so
ldd (0xb6f46000)
libtalloc.so.2 => /usr/lib/libtalloc.so.2 (0xb6efd000)
libgcc_s.so.1 => /lib/libgcc_s.so.1 (0xb6ee2000)
libc.so => ldd (0xb6f46000)
# ldd ./libtalloc.so
ldd (0xb6f5b000)
libgcc_s.so.1 => /lib/libgcc_s.so.1 (0xb6f29000)
libc.so => ldd (0xb6f5b000)
</pre></p> osmo-remsim - Bug #4456: SIGABRT while opening USB device in osmo-remsim-client-st2 on OpenWRT with muslhttps://osmocom.org/issues/4456?journal_id=187012020-06-13T18:01:00Zlaforge
<ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Resolved</i></li><li><strong>% Done</strong> changed from <i>40</i> to <i>100</i></li></ul><p>works with musl >= 1.1.23</p>