Project

General

Profile

Bug #3375

simtrace2.git master breaks cardem on qmod

Added by laforge 12 days ago. Updated 9 days ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
-
Start date:
07/04/2018
Due date:
% Done:

100%

Estimated time:

Description

After several hours of debugging and finally resorting to a git bisect, I have determined that the following two changes in master render it broken/unusable for qmod/cardem:

commit d814cd480c63fa9d1e18adeb4a167ad1e3bf5494 (HEAD)
Author: Harald Welte <laforge@gnumonks.org>
Date:   Wed Jul 4 04:02:31 2018 +0200

    Revert "ring buffer: increase buffer size from 128 to 256 to cope with large debug output" 

    This reverts commit a9bca48914abf4a16fccfbf223810c0918c2a200.

commit 1412a8c9972ab9e695802f6f4ccc134c7b7e2e21
Author: Harald Welte <laforge@gnumonks.org>
Date:   Wed Jul 4 04:00:39 2018 +0200

    Revert "console: use buffer and interrupts instead of busy loops for UART debug output" 

    This reverts commit eac1bec4285a48b466e9fd09b0513b3c49d393b3.

If I revert those two commits, simtrace2-remsim again works as expected.

History

#1 Updated by laforge 12 days ago

The revert of the "use buffer and interupts" commit can be avoided by applying the following patch:

--- a/firmware/libcommon/source/ringbuffer.c
+++ b/firmware/libcommon/source/ringbuffer.c
@@ -63,7 +63,7 @@ void rbuf_write(volatile ringbuf * rb, uint8_t item)
                local_irq_restore(state);
        } else {
                local_irq_restore(state);
-               TRACE_ERROR("Ringbuffer full, losing bytes!");
+               //TRACE_ERROR("Ringbuffer full, losing bytes!");
        }
 }

The point here is that we cannot use TRACE_ERROR() from ringbuffer.c, if we are using ringbuffer.c from our low-level console printing code

#2 Updated by laforge 12 days ago

https://gerrit.osmocom.org/#/c/simtrace2/+/9859 contains a cleaned up version of the ringbuffer.c patch.

Please note it's still required to revert the 256 byte size change in order to make cardemu work again. I have no idea why.

#3 Updated by laforge 12 days ago

  • Priority changed from Normal to High

#4 Updated by tsaitgaist 11 days ago

  • % Done changed from 0 to 10

I can reproduce the first error now, or at least a related one.
With a ring buffer size of 512 the watchdog triggers shortly after booting the main application.
This does not occur with a ring buffer size of 128.

I did not try to reproduce the simtrace2-remsim yet.
I will solve the buffer issue first (maybe they are related).

#5 Updated by tsaitgaist 9 days ago

  • Status changed from New to Resolved
  • % Done changed from 10 to 100

I've found the cause and resolved it.
See https://gerrit.osmocom.org/#/c/simtrace2/+/9908/ Change-Id: Ie354d603fe302d3d5bdfa9c31575411de722323b for the fix.

WARNING: to fix the issue, DFU has to be reflashed on all qmod boards.

The explanation will be long but might be useful to identify and debug similar issues in the future.

To debug it I used an SWD adapter and GDB:

arm-none-eabi-gdb --eval-command='target remote | openocd --file interface/stlink-v2.cfg --command "set CPUTAPID 0x2ba01477" --file target/at91sam3sXX.cfg  --command "gdb_port pipe" --command "init" --command "halt"' -eval-command='monitor halt' ./bin/qmod-cardem-dfu.elf

Part 1 - UART
-------------

The first bug we see (even before testing against simtrace2-remsim) is that the console output over UART is broken.
Since UART now uses the ring buffer and removing that commit removed the bug, it is a good idea to start looking at this.

The output of the console is sometimes blocked, and sometimes it displays garbage information (depends on the code, and has some random factor). This indicates that there might be a memory corruption.

Serial console output example:

=============}KRK¡ÿÿ

lets track libcommon/source/ringbuffer.c:rbuf_write, particularly rb->iwr which is where the data will be written.

(gdb) break rbuf_write
Breakpoint 1 at 0x4068ee: file libcommon/source/ringbuffer.c, line 73.
(gdb) commands
Type commands for breakpoint(s) 1, one per line.
End with a line saying just "end".
>p uart_tx_buffer.iwr
>continue
>end
(gdb) c

...

Breakpoint 1, rbuf_write (rb=rb@entry=0x200025c4 <uart_tx_buffer>, item=item@entry=110 'n') at libcommon/source/ringbuffer.c:73
73        if (NULL == rb) {
$148 = 147
Info : halted: PC: 0x004068f0

---Type <return> to continue, or q <return> to quit---
Breakpoint 1, rbuf_write (rb=rb@entry=0x200025c4 <uart_tx_buffer>, item=item@entry=100 'd') at libcommon/source/ringbuffer.c:73
73        if (NULL == rb) {
$149 = 148
Info : halted: PC: 0x004068f0

Breakpoint 1, rbuf_write (rb=rb@entry=0x200025c4 <uart_tx_buffer>, item=item@entry=13 '\r') at libcommon/source/ringbuffer.c:73
73        if (NULL == rb) {
$150 = 149
Info : halted: PC: 0x004068f0

Breakpoint 1, rbuf_write (rb=rb@entry=0x200025c4 <uart_tx_buffer>, item=item@entry=45 '-') at libcommon/source/ringbuffer.c:73
73        if (NULL == rb) {
$151 = 4294967293
Info : halted: PC: 0x004068f0

Breakpoint 1, rbuf_write (rb=rb@entry=0x200025cc <uart_tx_buffer>, item=item@entry=82 'R') at libcommon/source/ringbuffer.c:63
63        local_irq_save(state);
$286 = 0
Info : halted: PC: 0x004068c4

you see the uart_tx_buffer.iwr always getting incremented by 1 (because of the debug output), and at some point the value is very high (before the board restarts).

The code only allows the index to increment by 1 though:

rb->iwr = (rb->iwr + 1) % RING_BUFLEN;

Thus someone else seems to be changing this value.

Part 2 - PutHexa
----------------

In the source code nothing else is writing this variable, so lets check during runtime which part does it.

(gdb) watch uart_tx_buffer.iwr
Hardware watchpoint 1: uart_tx_buffer.iwr
(gdb) define tx
Type commands for definition of "tx".
End with a line saying just "end".
>p uart_tx_buffer.iwr
>continue
>end
(gdb) c
Continuing.

...

Program received signal SIGTRAP, Trace/breakpoint trap.
rbuf_write (rb=rb@entry=0x200025cc <uart_tx_buffer>, item=item@entry=10 '\n') at libcommon/source/ringbuffer.c:67
67            local_irq_restore(state);
(gdb) tx
$231 = 229

Program received signal SIGTRAP, Trace/breakpoint trap.
rbuf_write (rb=rb@entry=0x200025cc <uart_tx_buffer>, item=item@entry=13 '\r') at libcommon/source/ringbuffer.c:67
67            local_irq_restore(state);
(gdb) tx
$232 = 230

Program received signal SIGTRAP, Trace/breakpoint trap.
0x0040658c in PutHexa (
    pStr=pStr@entry=0x2000289b <card_handles+143> '=' <repeats 66 times>, "\n\rSLMtrace2 firmware 0.4.92-bf6b-dirty (C) 2010-2016 by Harald Welte\n\r", '=' <repeats 64 times>..., fill=48 '0', width=width@entry=5, maj=maj@entry=0 '\000', value=value@entry=166144)
    at libcommon/source/stdio.c:242
242        width--;
(gdb) 
$233 = 536881307

libcommon/source/stdio.c:PutHexa is writing the index, but the source code doesn't write anywhere uart_tx_buffer.iwr, and the line in the debugger only shows the entry of the function as writing the value.

Part 3 - stack overflow
-----------------------

The answer to this puzzle is in the registers:

(gdb) info reg
r0             0x2000289b    536881307
r1             0x30    48
r2             0x5    5
r3             0x0    0
r4             0x2000289b    536881307
r5             0x289009    2658313
r6             0x0    0
r7             0x5    5
r8             0x40a4e5    4236517
r9             0x40a391    4236177
r10            0x40a3ad    4236205
r11            0x10000000    268435456
r12            0x30    48
sp             0x200027c4    0x200027c4 <uart_tx_buffer+504>
lr             0x4065a9    4220329
pc             0x40658c    0x40658c <PutHexa+2>
xPSR           0x21000000    553648128
msp            0x200027c4    0x200027c4 <uart_tx_buffer+504>
psp            0x40    0x40
primask        0x0    0
basepri        0x0    0
faultmask      0x0    0
control        0x0    0

You see the stack pointer pointing to 0x200027c4:

sp             0x200027c4    0x200027c4 <uart_tx_buffer+504>

This is recognized by GDB as being uart_tx_buffer+504, and since the buffer is 512 bytes, the function initialization will write on the stack, thus also overwrite the .iwr member.

uart_tx_buffer is in the bss section since it is statically allocated.
Thus the stack grew so larger that it overflows onto the bss section on corrupts it.
In this example it was PutChar, but that could have been any function. It just depend on the call order and function sizes.

Part 4 - stack size
-------------------

The SAM3S4 has 48 kiB of RAM. How come the size grow so large that is reached the BSS section.

Let's have a look at the stack initialization, just at the start of the reset vector:

(gdb) load
Info : sam3 auto-erases while programming (request ignored)
Loading section .text, size 0x6d28 lma 0x404000
Loading section .relocate, size 0x65c lma 0x40ad28
Warn : keep_alive() was not invoked in the 1000ms timelimit. GDB alive packet not sent! (1255). Workaround: increase "set remotetimeout" in GDB
target halted due to debug-request, current mode: Thread 
xPSR: 0x01000000 pc: 0x004022a4 msp: 0x20002b04
Start address 0x406c68, load size 29572
Transfer rate: 22 KB/sec, 9857 bytes/write.
(gdb) info reg
...
sp             0x200027c4    0x200027c4 <uart_tx_buffer+504>
lr             0x4065a9    4220329
pc             0x406c68    0x406c68 <ResetException>
xPSR           0x21000000    553648128
msp            0x20002b04    0x20002b04 <pdwStack+524>

The initial stack pointer (msp) is actually not pointing at the end of the RAM (e.g. 0x200 c000), but at 0x20002b04 (~ 11 kB after the start of the RAM).
The code did not have to grow the stack ~ 48 kiB to overflow onto the bss, but only ~ 10 kiB, which is possible for main applications but still strange.

Looking at the stack properties:

$ objdump -h bin/qmod-dfu-flash.elf

bin/qmod-dfu-flash.elf:     file format elf32-little

Sections:
Idx Name          Size      VMA       LMA       File off  Algn
  0 .text         00003cb8  00400000  00400000  00010000  2**3
                  CONTENTS, ALLOC, LOAD, READONLY, CODE
  1 .relocate     00000098  20000000  00403cb8  00020000  2**2
                  CONTENTS, ALLOC, LOAD, CODE
  2 .bss          00000704  20000098  00403d50  00020098  2**2
                  ALLOC
  3 .stack        00002400  20000708  004043c0  00020098  2**3
                  ALLOC

The stack has a fixed size of 9 kiB (0x00002400), and with the .bss (0x00000704) we arrive at 0x20002b04, the initial stack pointer.

The stack sized is fixed in libboard/common/source/board_cstartup_gnu.c:

/* Stack Configuration */
#define STACK_SIZE       0x900     /** Stack size (in DWords) */
__attribute__ ((aligned(8),section(".stack")))
uint32_t pdwStack[STACK_SIZE] ;

and the (Atmel)linker script puts it just after the bss (instead of as usual putting it at the end of the RAM). ./libboard/common/resources/sam3s4/flash.ld :

    /* .bss section which is used for uninitialized data */
    .bss (NOLOAD) :
    {
        . = ALIGN(4);
        _sbss = . ;
        _szero = .;
        *(.bss .bss.*)
        *(COMMON)
        . = ALIGN(4);
        _ebss = . ;
        _ezero = .;
    } > ram

    /* stack section */
    .stack (NOLOAD):
    {
        . = ALIGN(8);
        *(.stack .stack.*)
    } > ram

    . = ALIGN(4);
    _end = . ;

Part 5 - stack initialization
-----------------------------

If you paid attention, you might have noticed that msp = 0x20002b04 was at the start of the main application (qmod-cardem-dfu.elf), but the sections I showed (using objdump) were for the bootloader (bin/qmod-dfu-flash.elf).

If you look at the sections of the main application:

$ objdump -h bin/qmod-cardem-dfu.elf

bin/qmod-cardem-dfu.elf:     file format elf32-little

Sections:
Idx Name          Size      VMA       LMA       File off  Algn
  0 .text         00006d20  00404000  00404000  00004000  2**3
                  CONTENTS, ALLOC, LOAD, READONLY, CODE
  1 .relocate     0000065c  20000020  0040ad20  00010020  2**2
                  CONTENTS, ALLOC, LOAD, CODE
  2 .bss          00002278  2000067c  0040b37c  0001067c  2**2
                  ALLOC
  3 .stack        00002400  200028f8  0040d5f8  0001067c  2**3
                  ALLOC

The stack has has the same size, but .bss is must larger (00002278).
Thus the stack only needed to grow ~ 2 kiB (0x0002b04 - 0x00002278) reach bss, which is very likely (and happens every time even during early boot).

But why did the main application used the initial stack pointer of the bootloader?

Well the stack pointer is the first work of the binary and loaded at boot:

xxd bin/qmod-dfu-flash.bin | head
00000000: 042b 0020 a522 4000 7d11 4000 0913 4000  .+. ."@.}.@...@.
00000010: 1d13 4000 2d13 4000 3d13 4000 0000 0000  ..@.-.@.=.@.....
00000020: 0000 0000 0000 0000 0000 0000 4d13 4000  ............M.@.

The second word is a pointer to the vector table (starting with the reset vector).
When DFU start the main application, it only starts the reset vector.
libboard/common/source/board_cstartup_gnu.c :

static void BootIntoApp(void)
{
    unsigned int *pSrc;
    void (*appReset)(void);

    pSrc = (unsigned int *) ((unsigned char *)IFLASH_ADDR + BOARD_DFU_BOOT_SIZE);
    SCB->VTOR = ((unsigned int)(pSrc)) | (0x0 << 7);
    appReset = (void(*)(void))pSrc[1];

    g_dfu->state = DFU_STATE_appIDLE;

    appReset();
}

But this does not (re)set the stack pointer. Thus the old stack pointer (from DFU, with much less bss) was kept.

The main application stack pointer is also located at the beginning of the image:

$ xxd bin/qmod-cardem-dfu.bin | head
00000000: f44c 0020 696c 4000 6158 4000 ed59 4000  .L. il@.aX@..Y@.
00000010: 015a 4000 115a 4000 215a 4000 0000 0000  .Z@..Z@.!Z@.....
00000020: 0000 0000 0000 0000 0000 0000 315a 4000  ............1Z@.

So before starting the main application, we need to ensure that the MSP has this initial value:

__asm__ volatile ("MSR msp,%0" : :"r"(*pSrc));

Part 6 - remsim
---------------

With that fixed, the main image was booting and running fine.

=============================================================================
SIMtrace2 firmware 0.4.93-a952 (C) 2010-2016 by Harald Welte                 
=============================================================================
-I- Chip ID: 0x28900960 (Ext 0x00000000)                                     
-I- Serial Nr. 51203120-38463850-33303231-36313035
-I- Reset Cause: 0x0                              
-I- Detected Quad-Modem ST12
-I- VERSION_DET ADC=3727 => 3000235 uVund
-I- USB init...
USBD_Init      
Std gDesc Dev Std sAddr SetAddr(15) Std gDesc Dev Std gDesc Qua -W- Sta 0x888A8 [0] -W- _ Std gDesc Qua -W- Sta 0x8028 [0] -W- _ Std gDesc Qua -W- Sta 0x888A8 [0] -W- _ Std gDesc Cfg Std gDesc Cfg Std gDesc Str0 Std gDesc Str2 Std gDesc Str1 Std sCfg SetCfg(1) cfgChanged1 Std gDesc Str5 Std gDesc Str7 Std gDesc Str8 -I- calling configure of all configurations...
-I- calling init of config 1...                                                                 
-I- 536881924: CLK activateda work-around
-I- 536881924: CLK activated
-I- Modem 0: physical SIM
-I- 0: Use local/physical SIM
-I- Modem 1: physical SIM
-I- 1: Use local/physical SIM
-I- entering main loop...
-I- 0: Card S-1t-i       
                  Std gDesc Str0 Std gDesc Str7 Std gDesc Str0 Std gDesc Str8 Std gDesc Str238 -I- 1: VCC activated
-I- 1: CLK activated
-I- 1: CLK deactivated
-I- 1: VCC deactivated
-I- 1: VCC activated
...
-I- 0: VCC deactivated

and running simtrace2-remsim did not produce any error:

$ sudo LD_LIBRARY_PATH=/usr/local/lib ./simtrace2-remsim --usb-vendor 1d50 --usb-product 4004 --usb-address 10 --usb-interface 0 --usb-config 1
simtrace2-remsim - Remote SIM card forwarding
(C) 2010-2017 by Harald Welte <laforge@gnumonks.org>

SCardEstablishContext: OK

SCardListReaders: OK

SCardConnect: OK

<- 01 05 00 00 00 00 09 00 01 
<- 02 02 00 00 00 00 09 00 01 
<= cardem_request_set_atr(3b 9f 96 80 1f c7 80 31 a0 73 be 21 13 67 43 20 07 18 00 00 01 a5 )
<- 01 02 00 00 00 00 1f 00 16 3b 9f 00 80 1f c7 80 31 a0 73 be 21 13 67 43 20 07 18 00 00 01 a5 
<- 02 01 00 00 00 00 0b 00 02 2c 01 
Entering main loop
^C<- 01 05 00 00 00 00 09 00 00 

serial output:

Std gDesc Str0 Std gDesc Str1 Std gDesc Str0 Std gDesc Str2 Std gDesc Str0 Std gDesc Str5 Std gDesc Str0 Std gDesc Str7 Std gDesc Str0 Std gDesc Str8 Std gDesc Qua -W- Sta 0x888A8 [0] -W- _ Std gDesc -W- USBDDriver_GetDescriptor: Unknown descriptor type (10)
-W- Sta 0x888A8 [0] -W- _ Std gSta Dev Std gDesc Str0 Std gDesc Str7 Std gDesc Str0 Std gDesc Str8 Std sInterface -W- Sta 0x8880C [0] -W- _ Std cFeat Hlt Std cFeat Hlt Std cFeat Hlt -I- 0: skipping unsupported card_insert to INSERTED
-I- Modem 0: virtual SIM
-I- 0: Use remote/emulated SIM
-I- 0: Asserting modem reset
-I- 0: De-asserting modem reset

I don't know what the expected output should look like though.

WARNING: to fix the issue, DFU has to be reflashed on all qmod boards.

#6 Updated by laforge 9 days ago

Amazing, thanks a lot for your diligence in hunting this down, and for your detailed report.

As I'm about to leave for my holidays, I could only quickly sweep it, but for sure I'll re-read
it in all details as soon as I find some time.

Regards,
Harald

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)