Bug #3375
closedsimtrace2.git master breaks cardem on qmod
100%
Description
After several hours of debugging and finally resorting to a git bisect, I have determined that the following two changes in master render it broken/unusable for qmod/cardem:
commit d814cd480c63fa9d1e18adeb4a167ad1e3bf5494 (HEAD) Author: Harald Welte <laforge@gnumonks.org> Date: Wed Jul 4 04:02:31 2018 +0200 Revert "ring buffer: increase buffer size from 128 to 256 to cope with large debug output" This reverts commit a9bca48914abf4a16fccfbf223810c0918c2a200. commit 1412a8c9972ab9e695802f6f4ccc134c7b7e2e21 Author: Harald Welte <laforge@gnumonks.org> Date: Wed Jul 4 04:00:39 2018 +0200 Revert "console: use buffer and interrupts instead of busy loops for UART debug output" This reverts commit eac1bec4285a48b466e9fd09b0513b3c49d393b3.
If I revert those two commits, simtrace2-remsim again works as expected.
Updated by laforge over 5 years ago
The revert of the "use buffer and interupts" commit can be avoided by applying the following patch:
--- a/firmware/libcommon/source/ringbuffer.c +++ b/firmware/libcommon/source/ringbuffer.c @@ -63,7 +63,7 @@ void rbuf_write(volatile ringbuf * rb, uint8_t item) local_irq_restore(state); } else { local_irq_restore(state); - TRACE_ERROR("Ringbuffer full, losing bytes!"); + //TRACE_ERROR("Ringbuffer full, losing bytes!"); } }
The point here is that we cannot use TRACE_ERROR() from ringbuffer.c, if we are using ringbuffer.c from our low-level console printing code
Updated by laforge over 5 years ago
https://gerrit.osmocom.org/#/c/simtrace2/+/9859 contains a cleaned up version of the ringbuffer.c patch.
Please note it's still required to revert the 256 byte size change in order to make cardemu work again. I have no idea why.
Updated by tsaitgaist over 5 years ago
- % Done changed from 0 to 10
I can reproduce the first error now, or at least a related one.
With a ring buffer size of 512 the watchdog triggers shortly after booting the main application.
This does not occur with a ring buffer size of 128.
I did not try to reproduce the simtrace2-remsim yet.
I will solve the buffer issue first (maybe they are related).
Updated by tsaitgaist over 5 years ago
- Status changed from New to Resolved
- % Done changed from 10 to 100
I've found the cause and resolved it.
See https://gerrit.osmocom.org/#/c/simtrace2/+/9908/ Change-Id: Ie354d603fe302d3d5bdfa9c31575411de722323b for the fix.
WARNING: to fix the issue, DFU has to be reflashed on all qmod boards.
The explanation will be long but might be useful to identify and debug similar issues in the future.
To debug it I used an SWD adapter and GDB:
arm-none-eabi-gdb --eval-command='target remote | openocd --file interface/stlink-v2.cfg --command "set CPUTAPID 0x2ba01477" --file target/at91sam3sXX.cfg --command "gdb_port pipe" --command "init" --command "halt"' -eval-command='monitor halt' ./bin/qmod-cardem-dfu.elf
Part 1 - UART
-------------
The first bug we see (even before testing against simtrace2-remsim) is that the console output over UART is broken.
Since UART now uses the ring buffer and removing that commit removed the bug, it is a good idea to start looking at this.
The output of the console is sometimes blocked, and sometimes it displays garbage information (depends on the code, and has some random factor). This indicates that there might be a memory corruption.
Serial console output example:
=============}KRK¡ÿÿ
lets track libcommon/source/ringbuffer.c:rbuf_write, particularly rb->iwr which is where the data will be written.
(gdb) break rbuf_write Breakpoint 1 at 0x4068ee: file libcommon/source/ringbuffer.c, line 73. (gdb) commands Type commands for breakpoint(s) 1, one per line. End with a line saying just "end". >p uart_tx_buffer.iwr >continue >end (gdb) c ... Breakpoint 1, rbuf_write (rb=rb@entry=0x200025c4 <uart_tx_buffer>, item=item@entry=110 'n') at libcommon/source/ringbuffer.c:73 73 if (NULL == rb) { $148 = 147 Info : halted: PC: 0x004068f0 ---Type <return> to continue, or q <return> to quit--- Breakpoint 1, rbuf_write (rb=rb@entry=0x200025c4 <uart_tx_buffer>, item=item@entry=100 'd') at libcommon/source/ringbuffer.c:73 73 if (NULL == rb) { $149 = 148 Info : halted: PC: 0x004068f0 Breakpoint 1, rbuf_write (rb=rb@entry=0x200025c4 <uart_tx_buffer>, item=item@entry=13 '\r') at libcommon/source/ringbuffer.c:73 73 if (NULL == rb) { $150 = 149 Info : halted: PC: 0x004068f0 Breakpoint 1, rbuf_write (rb=rb@entry=0x200025c4 <uart_tx_buffer>, item=item@entry=45 '-') at libcommon/source/ringbuffer.c:73 73 if (NULL == rb) { $151 = 4294967293 Info : halted: PC: 0x004068f0 Breakpoint 1, rbuf_write (rb=rb@entry=0x200025cc <uart_tx_buffer>, item=item@entry=82 'R') at libcommon/source/ringbuffer.c:63 63 local_irq_save(state); $286 = 0 Info : halted: PC: 0x004068c4
you see the uart_tx_buffer.iwr always getting incremented by 1 (because of the debug output), and at some point the value is very high (before the board restarts).
The code only allows the index to increment by 1 though:
rb->iwr = (rb->iwr + 1) % RING_BUFLEN;
Thus someone else seems to be changing this value.
Part 2 - PutHexa
----------------
In the source code nothing else is writing this variable, so lets check during runtime which part does it.
(gdb) watch uart_tx_buffer.iwr Hardware watchpoint 1: uart_tx_buffer.iwr (gdb) define tx Type commands for definition of "tx". End with a line saying just "end". >p uart_tx_buffer.iwr >continue >end (gdb) c Continuing. ... Program received signal SIGTRAP, Trace/breakpoint trap. rbuf_write (rb=rb@entry=0x200025cc <uart_tx_buffer>, item=item@entry=10 '\n') at libcommon/source/ringbuffer.c:67 67 local_irq_restore(state); (gdb) tx $231 = 229 Program received signal SIGTRAP, Trace/breakpoint trap. rbuf_write (rb=rb@entry=0x200025cc <uart_tx_buffer>, item=item@entry=13 '\r') at libcommon/source/ringbuffer.c:67 67 local_irq_restore(state); (gdb) tx $232 = 230 Program received signal SIGTRAP, Trace/breakpoint trap. 0x0040658c in PutHexa ( pStr=pStr@entry=0x2000289b <card_handles+143> '=' <repeats 66 times>, "\n\rSLMtrace2 firmware 0.4.92-bf6b-dirty (C) 2010-2016 by Harald Welte\n\r", '=' <repeats 64 times>..., fill=48 '0', width=width@entry=5, maj=maj@entry=0 '\000', value=value@entry=166144) at libcommon/source/stdio.c:242 242 width--; (gdb) $233 = 536881307
libcommon/source/stdio.c:PutHexa is writing the index, but the source code doesn't write anywhere uart_tx_buffer.iwr, and the line in the debugger only shows the entry of the function as writing the value.
Part 3 - stack overflow
-----------------------
The answer to this puzzle is in the registers:
(gdb) info reg r0 0x2000289b 536881307 r1 0x30 48 r2 0x5 5 r3 0x0 0 r4 0x2000289b 536881307 r5 0x289009 2658313 r6 0x0 0 r7 0x5 5 r8 0x40a4e5 4236517 r9 0x40a391 4236177 r10 0x40a3ad 4236205 r11 0x10000000 268435456 r12 0x30 48 sp 0x200027c4 0x200027c4 <uart_tx_buffer+504> lr 0x4065a9 4220329 pc 0x40658c 0x40658c <PutHexa+2> xPSR 0x21000000 553648128 msp 0x200027c4 0x200027c4 <uart_tx_buffer+504> psp 0x40 0x40 primask 0x0 0 basepri 0x0 0 faultmask 0x0 0 control 0x0 0
You see the stack pointer pointing to 0x200027c4:
sp 0x200027c4 0x200027c4 <uart_tx_buffer+504>
This is recognized by GDB as being uart_tx_buffer+504, and since the buffer is 512 bytes, the function initialization will write on the stack, thus also overwrite the .iwr member.
uart_tx_buffer is in the bss section since it is statically allocated.
Thus the stack grew so larger that it overflows onto the bss section on corrupts it.
In this example it was PutChar, but that could have been any function. It just depend on the call order and function sizes.
Part 4 - stack size
-------------------
The SAM3S4 has 48 kiB of RAM. How come the size grow so large that is reached the BSS section.
Let's have a look at the stack initialization, just at the start of the reset vector:
(gdb) load Info : sam3 auto-erases while programming (request ignored) Loading section .text, size 0x6d28 lma 0x404000 Loading section .relocate, size 0x65c lma 0x40ad28 Warn : keep_alive() was not invoked in the 1000ms timelimit. GDB alive packet not sent! (1255). Workaround: increase "set remotetimeout" in GDB target halted due to debug-request, current mode: Thread xPSR: 0x01000000 pc: 0x004022a4 msp: 0x20002b04 Start address 0x406c68, load size 29572 Transfer rate: 22 KB/sec, 9857 bytes/write. (gdb) info reg ... sp 0x200027c4 0x200027c4 <uart_tx_buffer+504> lr 0x4065a9 4220329 pc 0x406c68 0x406c68 <ResetException> xPSR 0x21000000 553648128 msp 0x20002b04 0x20002b04 <pdwStack+524>
The initial stack pointer (msp) is actually not pointing at the end of the RAM (e.g. 0x200 c000), but at 0x20002b04 (~ 11 kB after the start of the RAM).
The code did not have to grow the stack ~ 48 kiB to overflow onto the bss, but only ~ 10 kiB, which is possible for main applications but still strange.
Looking at the stack properties:
$ objdump -h bin/qmod-dfu-flash.elf bin/qmod-dfu-flash.elf: file format elf32-little Sections: Idx Name Size VMA LMA File off Algn 0 .text 00003cb8 00400000 00400000 00010000 2**3 CONTENTS, ALLOC, LOAD, READONLY, CODE 1 .relocate 00000098 20000000 00403cb8 00020000 2**2 CONTENTS, ALLOC, LOAD, CODE 2 .bss 00000704 20000098 00403d50 00020098 2**2 ALLOC 3 .stack 00002400 20000708 004043c0 00020098 2**3 ALLOC
The stack has a fixed size of 9 kiB (0x00002400), and with the .bss (0x00000704) we arrive at 0x20002b04, the initial stack pointer.
The stack sized is fixed in libboard/common/source/board_cstartup_gnu.c:
/* Stack Configuration */ #define STACK_SIZE 0x900 /** Stack size (in DWords) */ __attribute__ ((aligned(8),section(".stack"))) uint32_t pdwStack[STACK_SIZE] ;
and the (Atmel)linker script puts it just after the bss (instead of as usual putting it at the end of the RAM). ./libboard/common/resources/sam3s4/flash.ld :
/* .bss section which is used for uninitialized data */ .bss (NOLOAD) : { . = ALIGN(4); _sbss = . ; _szero = .; *(.bss .bss.*) *(COMMON) . = ALIGN(4); _ebss = . ; _ezero = .; } > ram /* stack section */ .stack (NOLOAD): { . = ALIGN(8); *(.stack .stack.*) } > ram . = ALIGN(4); _end = . ;
Part 5 - stack initialization
-----------------------------
If you paid attention, you might have noticed that msp = 0x20002b04 was at the start of the main application (qmod-cardem-dfu.elf), but the sections I showed (using objdump) were for the bootloader (bin/qmod-dfu-flash.elf).
If you look at the sections of the main application:
$ objdump -h bin/qmod-cardem-dfu.elf bin/qmod-cardem-dfu.elf: file format elf32-little Sections: Idx Name Size VMA LMA File off Algn 0 .text 00006d20 00404000 00404000 00004000 2**3 CONTENTS, ALLOC, LOAD, READONLY, CODE 1 .relocate 0000065c 20000020 0040ad20 00010020 2**2 CONTENTS, ALLOC, LOAD, CODE 2 .bss 00002278 2000067c 0040b37c 0001067c 2**2 ALLOC 3 .stack 00002400 200028f8 0040d5f8 0001067c 2**3 ALLOC
The stack has has the same size, but .bss is must larger (00002278).
Thus the stack only needed to grow ~ 2 kiB (0x0002b04 - 0x00002278) reach bss, which is very likely (and happens every time even during early boot).
But why did the main application used the initial stack pointer of the bootloader?
Well the stack pointer is the first work of the binary and loaded at boot:
xxd bin/qmod-dfu-flash.bin | head 00000000: 042b 0020 a522 4000 7d11 4000 0913 4000 .+. ."@.}.@...@. 00000010: 1d13 4000 2d13 4000 3d13 4000 0000 0000 ..@.-.@.=.@..... 00000020: 0000 0000 0000 0000 0000 0000 4d13 4000 ............M.@.
The second word is a pointer to the vector table (starting with the reset vector).
When DFU start the main application, it only starts the reset vector.
libboard/common/source/board_cstartup_gnu.c :
static void BootIntoApp(void) { unsigned int *pSrc; void (*appReset)(void); pSrc = (unsigned int *) ((unsigned char *)IFLASH_ADDR + BOARD_DFU_BOOT_SIZE); SCB->VTOR = ((unsigned int)(pSrc)) | (0x0 << 7); appReset = (void(*)(void))pSrc[1]; g_dfu->state = DFU_STATE_appIDLE; appReset(); }
But this does not (re)set the stack pointer. Thus the old stack pointer (from DFU, with much less bss) was kept.
The main application stack pointer is also located at the beginning of the image:
$ xxd bin/qmod-cardem-dfu.bin | head 00000000: f44c 0020 696c 4000 6158 4000 ed59 4000 .L. il@.aX@..Y@. 00000010: 015a 4000 115a 4000 215a 4000 0000 0000 .Z@..Z@.!Z@..... 00000020: 0000 0000 0000 0000 0000 0000 315a 4000 ............1Z@.
So before starting the main application, we need to ensure that the MSP has this initial value:
__asm__ volatile ("MSR msp,%0" : :"r"(*pSrc));
Part 6 - remsim
---------------
With that fixed, the main image was booting and running fine.
============================================================================= SIMtrace2 firmware 0.4.93-a952 (C) 2010-2016 by Harald Welte ============================================================================= -I- Chip ID: 0x28900960 (Ext 0x00000000) -I- Serial Nr. 51203120-38463850-33303231-36313035 -I- Reset Cause: 0x0 -I- Detected Quad-Modem ST12 -I- VERSION_DET ADC=3727 => 3000235 uVund -I- USB init... USBD_Init Std gDesc Dev Std sAddr SetAddr(15) Std gDesc Dev Std gDesc Qua -W- Sta 0x888A8 [0] -W- _ Std gDesc Qua -W- Sta 0x8028 [0] -W- _ Std gDesc Qua -W- Sta 0x888A8 [0] -W- _ Std gDesc Cfg Std gDesc Cfg Std gDesc Str0 Std gDesc Str2 Std gDesc Str1 Std sCfg SetCfg(1) cfgChanged1 Std gDesc Str5 Std gDesc Str7 Std gDesc Str8 -I- calling configure of all configurations... -I- calling init of config 1... -I- 536881924: CLK activateda work-around -I- 536881924: CLK activated -I- Modem 0: physical SIM -I- 0: Use local/physical SIM -I- Modem 1: physical SIM -I- 1: Use local/physical SIM -I- entering main loop... -I- 0: Card S-1t-i Std gDesc Str0 Std gDesc Str7 Std gDesc Str0 Std gDesc Str8 Std gDesc Str238 -I- 1: VCC activated -I- 1: CLK activated -I- 1: CLK deactivated -I- 1: VCC deactivated -I- 1: VCC activated ... -I- 0: VCC deactivated
and running simtrace2-remsim did not produce any error:
$ sudo LD_LIBRARY_PATH=/usr/local/lib ./simtrace2-remsim --usb-vendor 1d50 --usb-product 4004 --usb-address 10 --usb-interface 0 --usb-config 1 simtrace2-remsim - Remote SIM card forwarding (C) 2010-2017 by Harald Welte <laforge@gnumonks.org> SCardEstablishContext: OK SCardListReaders: OK SCardConnect: OK <- 01 05 00 00 00 00 09 00 01 <- 02 02 00 00 00 00 09 00 01 <= cardem_request_set_atr(3b 9f 96 80 1f c7 80 31 a0 73 be 21 13 67 43 20 07 18 00 00 01 a5 ) <- 01 02 00 00 00 00 1f 00 16 3b 9f 00 80 1f c7 80 31 a0 73 be 21 13 67 43 20 07 18 00 00 01 a5 <- 02 01 00 00 00 00 0b 00 02 2c 01 Entering main loop ^C<- 01 05 00 00 00 00 09 00 00
serial output:
Std gDesc Str0 Std gDesc Str1 Std gDesc Str0 Std gDesc Str2 Std gDesc Str0 Std gDesc Str5 Std gDesc Str0 Std gDesc Str7 Std gDesc Str0 Std gDesc Str8 Std gDesc Qua -W- Sta 0x888A8 [0] -W- _ Std gDesc -W- USBDDriver_GetDescriptor: Unknown descriptor type (10) -W- Sta 0x888A8 [0] -W- _ Std gSta Dev Std gDesc Str0 Std gDesc Str7 Std gDesc Str0 Std gDesc Str8 Std sInterface -W- Sta 0x8880C [0] -W- _ Std cFeat Hlt Std cFeat Hlt Std cFeat Hlt -I- 0: skipping unsupported card_insert to INSERTED -I- Modem 0: virtual SIM -I- 0: Use remote/emulated SIM -I- 0: Asserting modem reset -I- 0: De-asserting modem reset
I don't know what the expected output should look like though.
WARNING: to fix the issue, DFU has to be reflashed on all qmod boards.
Updated by laforge over 5 years ago
Amazing, thanks a lot for your diligence in hunting this down, and for your detailed report.
As I'm about to leave for my holidays, I could only quickly sweep it, but for sure I'll re-read
it in all details as soon as I find some time.
Regards,
Harald