Solving an embedded mystery in ten easy steps
Sometimes there are bugs that cause mysterious effects that seems to defy all logic, but since we're dealing with computers we know that there's always a logical explanation for everything that happens. If you're running on a desktop computer with a few gigs of code running at once, finding the explanation is sometimes too time consuming, but on embedded devices you can usually figure out what's going on - one step at a time.
At work we're working on an embedded project based on a Cortex M3 ARM microcontroller (NXP's LPC1763) running FreeRTOS. In this project we don't have any external memory, and the internal 64kB has been a problematic constraint for a while. I've had stack overflows cause weird behaviour, memory corruption and leaks. Par for the course when developing C code.
Recently I updated to a newer version of our BSP to be able to run the code on my Mac using the POSIX implementation of the BSP. After getting that up and running it was time to update things on the ARM side, and this is where I ran into ”interesting behaviour”. "Interesting" is rarely a good thing when working on embedded projects…
I was getting stuck in task switches that didn't finish, hard faults in places that made no sense, ending up in
bsp_epic_fail()
or just spinning around in a single task without switching to the others.
This was likely due to either uninitialised memory, or stack overflows. There's a helpful comment in the FreeRTOS code that says ”if you get stuck here, you probably ran out of stack somewhere”. Time to get investigating!
Step 1: Things are zero
Looking around a bit in the debugger while stopped in one of the impossible states, I noticed that things were zero that should have been non-zero after init.
Step 2: They’re set properly on boot
Double-checked that init is in fact getting called, and that values are set as expected during startup.
Step 3: BSP init code is clearing RAM twice
That means something is overwriting the values, but who? A watchpoint set on one of the variables in question showed that it was the BSP init code that cleared the values. Now, it is actually supposed to do this - but only on boot. There's a portion of RAM called BSS where the linker places variables that are supposed to start life as zeroes.
Step 4: Nothing is calling the init code; broken callback or function pointer?
Now we know what code is causing havoc, but how did we end up there? The CPU doesn't jump to code on its own initiative, so something has to tell it to go there. What could that be? Nothing in the code referenced the init function, so it had to be either a random value that happened to point to this code, or a NULL pointer that somehow ended up there.
Step 5: Serial processing code is calling init code
About here is when my colleague got involved, and we proceeded to determine what called the init code. I suspected that some interrupt handler might be uninitialised, since the errors seemed to occur after a few ms, but not acting exactly the same every time.
We divided and conquered the problem by setting breakpoints at suspected locations and stepping until we crashed, narrowing it down a bit at a time. Soon we knew that the serial data task was the cause, due to a pointer being NULL that couldn't possibly be NULL.
Step 6: Because a pointer that cannot be NULL actually is
Here's the offending code, notice the missing NULL check after allocating the channel.
channel_t *channel = channel_new(...);
while ( true )
{
if (bsp_uart_is_data_pending(UPSTREAM_UART))
{
uint8_t b;
while (bsp_uart_read(DSS_UART, &b, 1) == 1)
{
serial_channel_add_data(channel, b, bsp_timer_start());
}
}
}
The reason for the missing NULL check is that we know that
channel_new()
can't fail at this point; we've allocated space for
as many channels as we'll ever use. But still we get a NULL.
Step 7: It is NULL because something else that should be zero isn't
Looking into the
channel_new()
code we see that there is indeed no way we could end up with a NULL,
because we know that
.in_use
is false on boot (all uninitialised global variables are set to zero in C).
static channel_t g_channels[NUM_CHANNELS];
channel_t *channel_new(...)
{
channel_t *channel = NULL;
for (int i=0; i<NUM_CHANNELS; i++)
{
channel_t *ch = &g_channels[i];
if (ch->in_use)
continue;
channel = ch;
}
if (!channel)
return NULL;
…
}
But a breakpoint in
channel_new()
showed that when the function is called,
g_channels
isn’t zeroed
out - it’s filled with random data. But that’s not possible, the BSP init code zeroes out everything
in BSS - it’s because it's good at doing that we’re here looking for a bug!
Step 8: It isn't zero because it isn't covered by BSS init code
Looks like we need to have a look at what the RAM init code is actually doing:
extern char chipZeroStart; // RAM, start of zeroed statics
extern char chipZeroEnd; // end of same, +1
// Clear uninitialized statics (.bss section)
for (char *target=&chipZeroStart; target<&chipZeroEnd;) {
*target++ = 0;
}
Besides being a bit inefficient due to writing a byte at a time instead of writing words, it looks
like it’s doing what it’s supposed to. But how does it know where the BSS section starts and ends? The
chipZeroStart/End
variables are provided by the linker, after it has placed the relevant variables.
The linker script looks like this:
SECTIONS {
...
.bss : {
chipZeroStart = .;
*(.bss .bss.*) /* zero-filled static data (non-const) */
. = ALIGN(4);
chipZeroEnd = .;
}
}
It sets the
chipZeroStart
variable to the address before the BSS segment, adds all the
sections named .bss*, aligns to word size and then puts the current address into
chipZeroEnd
.
We know that our variables should go into .bss.something sections, because we’re using
the -fdata-sections parameter when running gcc.
Side note: linker scripts are insanely easy to get wrong, and the most fragile piece of code in any project.
Step 9: It isn't covered because it isn't in a section starting with ".bss"
Even though we know that things set up properly, we’ll just check the linker map file to make sure
that – hey,
g_channels
is in section COMMON. And COMMON is placed
after
chipZeroEnd
! This means
that
g_channels
isn’t being cleared by the init code. Which means that
channel_new()
is returning NULL,
which causes the serial task to call the init code.
Step 10: Tweaked linker script, everything ok
We added COMMON to the sections added to BSS, which made sure that
g_channels
gets cleared and that
channel_new()
succeeds. There’s a flag you can give to gcc, -fno-common, that will tell it to
not place things in COMMON, but we still needed the link script fix because a couple of variables
from our stdlib were also in COMMON.
Tying up some loose ends
But we’re forgetting something - how did we end up in the init code at all? There were no references to it from the serial task code, so it shouldn’t be possible. To understand this, we need to know a little something about memory on embedded systems.
On ARM microcontrollers there's usually nothing special about NULL, its just a pointer that points to address zero, and that’s usually the beginning of Flash, where your code is. On the Cortex M series an interrupt vector is placed at address zero, which contains a bunch of function pointers:
Addr Value
0 initial stack pointer
4 init function
8 ...
When running things on a larger machine with an MMU you’re probably going to have a special ”not touching allowed” page mapped to NULL, that will cause a segfault if you try to even read from it. Not so on small embedded devices. Here the read will succeed and your code will continue running. In our case, this is how we managed to get the address to the init code. Notice that the second value above is a pointer to the init function. Now look at our serial handling code:
struct channel_t {
uint32_t something;
void (*add_data)(void *, uint8_t, uint32_t);
…
}
serial_receive_result_t serial_channel_add_data(channel_t *ch, uint8_t byte, uint32_t timestamp)
{
return ch->add_data(ch, byte, timestamp);
}
The serial task calls
serial_channel_add_data()
with ch set to NULL. Then
serial_channel_add_data()
calls the function
pointer in channel, which is offset from the base address of the channel by 4 bytes. So if channel is NULL, it
treats NULL+4 as a function pointer - and this is where the pointer to our init code is.
Mystery solved; as usual it’s all due to a series of minor details being wrong.