M0AGX / LB9MG

Amateur radio and embedded systems

Things you should know about the stack on Cortex-M

The stack is one of the "invisible" parts of almost every C implementation. It comes into view usually when something goes wrong. On a PC with an operating system in a "good" bad case the application will crash in an obvious way but issues with the stack can also lead to security vulnerabilities. On bare-metal embedded systems stack overflows can manifest themselves as "silent" data corruption (in the worst case) or obvious crashes/resets (in the "good" case). I will focus on the bare-metal Cortex-M use case.

Understanding stack usage is also crucial to knowing how much "free" RAM is there left in an embedded system.

This article applies mostly to Cortex-M but some techniques may also be relevant for other architectures.

What is the stack? What is on the stack?

The stack is a convenient method to store local variables, some function arguments and return addresses. It is not mandated by the C standard but every C implementation I have worked with uses a stack. The compiler usually tries very hard to keep as many variables as possible in the CPU registers for best performance so not all local variables end up on the stack. A good example can be a loop index variable.

On the other hand, static variables have fixed (static) addresses allocated during link time. Most C implementations (let me know if you know one that does not) keep static variables in two sections called bss and data. The bss section keeps variables that are initialized to zero at startup and the data section holds the ones that are initialized to any other value.

The exact usage of the stack is part of the platform's ABI.

Before diving deeper let's start by some trivial examples of what goes where.

Example 1

1
2
3
4
5
6
7
8
9
int GLOBAL_foo; //static variable (external linkage), goes to the bss section
int GLOBAL_bar = 5; //static variable (external linkage), goes to the data section

void my_func1(void){
   int my_array[16]; //automatic variable, goes onto the stack
   for (int i = 0; i < 16; i++){  //automatic variable, most likely kept in a register
       my_array[i] = i * 2;
   }
}

Example 2

1
2
3
int my_func2(int a, int b){ //a & b most likely passed in registers
    return a * b * 3; //most likely returned in a register
}

Example 3

This pattern should generally be avoided on bare metal because you do not know exactly what the code will be compiled to and what will be the ultimate memory usage. Probably the compiler will make the caller allocate space on the stack for the returned value and pass a pointer to the callee.

A better pattern is to pass pointers to structs in an explicit way.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
typedef struct { //"big" struct
    int x[10];
    int y[10];
} foo_t;

foo_t get_foo(int seed){ //seed most likely passed in registers
    foo_t f; //stored on the stack

    for (int i = 0; i < 10; i++){
        f.x[i] = seed * 2;
        f.y[i] = seed * 3;
    }
    return f; //most likely returned via the stack
}

Example 4

You have function main() calling function foo(), that calls function bar(), that calls function baz(), but foo() can also call baz().

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
void baz(void){
    //do something
    return; //where should this return return to?
}
void bar(void){
    baz();
    //do something
    return;
}
void foo(void){
    if (magic){
        bar();
    } else {
        baz();
    }
    //do something more
    return;
}
void main(void){
    foo();
}

On every function call there will be some "housekeeping" data stored on the stack so that the CPU knows where to continue from once the called function returns.

The compiler (and linker) tries to optimize function calls by inlining so not every function call (from the C source) will actually use the stack.

General rule: the more nested function calls, the higher the stack usage. Some MCUs have a hardware stack with a fixed limit (PIC MCUs) so you may not even be able to call functions "too deep" or there will be a performance penalty due to extra code that the compiler has to insert. Cortex-M does not have such limitations.

How to avoid stack overflows

The best way to avoid stack overflow is to not put too much data on the stack and rely on static allocation instead. Static memory allocation always gives a definitive answer how much memory is used. In the context of embedded systems you can for example allocate a buffer for a text that goes to a display based on how much text can physically fit on the screen.

Some tips:

  1. Avoid large local arrays.
  2. Especially avoid local variable length arrays.
  3. Do not pass structs as arguments. Pass pointers to structs instead.
  4. Do not return structs directly. Instead make the caller allocate the memory and only pass a pointer to where the result should be stored.
  5. Avoid deep nesting of functions.
  6. Avoid recursion.
  7. If you have to use recursion - limit its depth.

The only example I can think of where recursion is useful is serialization of a tree or similar structure. Imagine:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
void save_the_tree(node_t *node_ptr){

    //save node data...

    if (node_ptr->branch_a){
        save_the_tree(node_ptr->branch_a);
    }
    if (node_ptr->branch_b){
        save_the_tree(node_ptr->branch_b);
    }
}

The code is very easy to understand as it descends down the tree. Of course every call uses more and more stack. You may have a general idea how deep the tree can possibly go but the next person working on the project may not so the recursion should be explicitly limited. Example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
void save_the_tree(node_t *node_ptr){
    static uint32_t nesting_level = 0;
    nesting_level++;
    if (nesting_level > 10){
        __BKPT(1); //nesting is too deep!
    }

    //save node data...

    if (node_ptr->branch_a){
        save_the_tree(node_ptr->branch_a);
    }
    if (node_ptr->branch_b){
        save_the_tree(node_ptr->branch_b);
    }
    nesting_level--;
}

Typical MCU memory map

Modern small and "medium" MCUs have RAM & flash. Cortex-M has a single address space so the memories are distinguished only by address. There are no different instructions to access RAM, flash, peripheral registers etc. Here is an example from EFM32HG:

EFM32 memory map

You can see where the flash (labelled "code") and RAM is. This particular MCU has at most 64 KB of flash so the physically implemented address space for code actually ends at 0xFFFF.

As the stack is constantly changing at runtime it will be of course located in the RAM. Somewhere in the RAM.

Note: there are two (totally orthogonal) conventions for drawing memory maps:

  • the highest address on top
  • the lowest address on top

This applies to documentation, linker maps, or debugger windows. I will use "up" and "upwards" to denote "the address is incrementing" and "down"/"downwards" for "the address is decrementing" even though the highest address may be displayed at the bottom.

The stack can theoretically grow in either direction (upwards and downwards). However, Cortex-M uses a descending stack which means that it grows downwards. It starts at some (high) address and the more data is pushed onto the stack the lower the stack pointer gets. Cortex-M has a dedicated hardware register called SP that holds the stack pointer all the time.

Where is the stack?

Wikipedia has this classic C application memory layout diagram:

C application layout

Let's skip the text section for now as on Cortex-M it will usually be stored in flash and let's skip the heap as it should be generally avoided in bare-metal systems. Heap is the place where memory allocated using malloc() and free() comes from.

On an MCU without virtual memory the data and bss sections are usually placed at the beginning of RAM (ie. lowest RAM address) and have fixed sizes that are determined at link time. Stack is placed at the top of RAM and grows downwards at runtime. If heap is not used then the memory between the stack and data/bss can generally be regarded as "free RAM".

What happens when the stack grows too far? It "hits" the static data sections and overwrites data that has been stored there. Suddenly some of your variables simply have wrong values and the application can't work reliably anymore.

As the stack changes at runtime there is no universal way to predict maximum stack usage. This would require solving the halting problem. There are however several "good enough" approaches to stack size management.

I would split the methods into two general groups:

  • Measurement of the actual stack usage
  • Making sure that a certain stack size limit has not been exceeded

Some may be more useful during development, some should be used "in the field" to spot stack overflow and panic in a predictable way.

Stack section in linker scripts

Many Cortex-M linker scripts have a section called "stack" or similar. This is often used with IDEs that can "set the stack size in a project".

It does not set the stack size nor prevent stack overflows in any way. It only covers a situation where you have for example 8 KB of RAM, know that 1 KB for the stack will be okay, and inadvertently use 7.5 KB for static variables. In this situation linking will fail and you will get an error message saying that the sections are too big to fit into memory.

Note about interrupts

Interrupts on Cortex-M behave just like regular functions. They just tend to happen "out of the blue". All local variables in the interrupt handler will of course use the stack. Additionally, the interrupt controller (NVIC) will save the context on the stack before executing the handler.

The context is the value of all CPU registers. If the device has a floating-point unit and the interrupt handler uses floating-point math the FPU registers have to be stacked as well. By default stacking of FPU registers is disabled (lazy stacking) and has to be enabled in the SCB (system control block peripheral).

Cortex-M interrupts can be nested. If two interrupts have different priorities assigned the lower priority interrupt handler can be interrupted by a higher priority interrupt handler. This is what the N in NVIC stands for. In that case the stack will hold the data for the main code, the lower priority interrupt handler and the higher priority interrupt handler.

MCUs often have many interrupt priority levels implemented so you could have for example four nested interrupts but that is usually bad design practice. Interrupt handlers should be kept short so that they do not have to be interrupted to meet overall system response timings.

Main and process stack pointers are explained in the RTOS section at the end of this article.

Bare metal techniques

Compiler analysis with -fstack-usage

The stack is managed by some flavor of PUSH and POP CPU instructions. The compiler generates these instructions when needed. It has to generate code that will properly allocate all the local variables so the compiler must know how much stack to use, right? Yes! To a certain extent. 🙂

GCC supports building code with -fstack-usage that generates extra files with stack usage information. Generated files can then be used with a post-processing script that generates stack usage statistics.

This technique basically works. To calculate the peak stack usage you have to sum the highest stack consumption by a regular function and the highest consumption by an ISR.

What are the edge cases? -fstack-usage does not work for function pointers and callbacks. GCC will not generate any stack usage information for calls via pointers. The compiler can't know in advance which function the pointer will point to at runtime.

Observing the stack with a debugger

This is the easiest technique as it requires totally no firmware support. It can be used on any architecture, not only Cortex-M. Let's start by flashing the firmware, keeping the CPU halted and filling all RAM with a single value. Filling RAM (from the beginning at 0x20000000) in Ozone it looks like this:

Filling RAM in Ozone

All RAM locations should have the same value after filling:

RAM after filling

Now let the firmware run for a while, send some data to the embedded system, press buttons etc. Halt the CPU after a while and observe the top of RAM. The AAs have been overwritten by the stack:

Stack after run

You can see the stack usage by looking at the first memory location that still has the pre-filled value.

Scrolling the memory viewer towards the beginning of the RAM you can also spot where the data/bss sections start (they were initialized by the startup code).

Static data in RAM

All RAM that still holds the AA pattern is essentially unused.

Self-checking (pre-fill)

The same approach can be recreated purely in firmware (again, on any architecture). The firmware can fill the "free space" in a loop with a known pattern and then periodically check for the last occurrence of this pattern when counting upwards from bss or look for the first occurance when counting downwards from the top of RAM.

The initial fill loop needs to know where to start and when to stop (and not overwrite its own stack). The end of bss can be easily found in the symbols emitted by the linker. For example: _ebss or __bss_end (check your linker script for similar names). The symbol can be declared in C code like a regular variable extern const uint32_t _ebss;. The last address of bss section is simply &_ebss. You will certainly find these symbols and a similar loop somewhere in the startup code that initializes bss to zero.

The second problem is how to find the other end of "free" RAM. A simple way is to pre-fill the RAM with the debugger, run the startup and init code up to a breakpoint and see how much stack has been actually used by this early code. For example if the startup code used 72 bytes of stack you can hardcode the end of fill code to approximately 128 bytes. Regular stack usage will be larger so the initial fill does not have to cover every unused byte.

Once in a while the code can loop over RAM locations and store the peak stack usage in a variable that can be later sent over UART or accessed through a different interface. This technique is useful in the field because you can deliver the device to the customer, let it run for days or weeks in realistic conditions and then read out the stack usage. Runtime overhead should not also be too big so I would keep this instrumentation in production releases.

This technique does not offer any protection from a stack overflow once it happens. It may also fail if your application crashes or goes crazy before the check.

Self-checking (canary)

Instead of filling the whole free RAM and looping to find the overwritten addresses you can simply place a known value at &_ebss (maybe one word above it, watch out for an off-by-one) and keep checking it once in a while. If the magic value is still at the end of bss section it means that is has never been touched by the growing stack. Simple. 🙂

Another variation can be the placement of several such canaries (for example every 1 KB) and checking them at runtime.

This technique does not offer any protection from a stack overflow once it happens. It may also fail if your application crashes or goes crazy before the check.

Self-checking (debug and watchpoint unit)

Cortex-M3 and larger cores have a debug and watchpoint unit (DWT) that is accessible from firmware. Firmware can use it to set watchpoints on certain memory locations. When they are written the CPU immediately executes the DebugMon exception that can leave debugging breadcrumbs for analysis, make the system safe and reboot.

The DWT can be set to watch for writes to &_ebss. This technique is 100% reliable and will catch every stack overflow.

Self-checking (MPU)

The memory protection unit can be used in a similar way to the DWT mentioned above. MPUs usually operate on "large" and aligned regions of memory. The smallest region in the standard Cortex-M3 MPU is 1 KB.

The MPU can be for example configured with a 1 KB "red zone" right above bss so that any write access in that region will trigger an exception. The exception can then leave debugging breadcrumbs (just like in the case of DWT).

Alternatively, the MPU could be configured with a default "red zone" policy across the whole RAM and two smaller "green zone" policies for the data/bss and the stack.

This technique is 100% reliable and will catch every stack overflow.

"Inverted linker layout"

What if the data and bss sections were placed at the top of RAM and the stack below them? The stack could grow until it went past the "bottom" of the RAM triggering a bus fault exception.

Such a scheme is perfectly possible. The issue is (at least with the GNU linker) that the linker accepts only start addresses of sections, not the end addresses. There is no way to tell the linker to "start data anywhere but make sure it ends before 0x20002000". This can be worked around by doing the linking twice using two different linker scripts and some scripting.

First linking is used to determine the exact data and bss sizes. They can be extracted for example using nm or size commands. A script calculates how much RAM there is left for the stack and launches the linker again. The second linker script lays out the sections in this order: stack, bss & data. The size of the stack section is passed via command line. For example -Wl,--defsym,__stack_size=0x123245678.

Of course you can also reserve a fixed size for the stack and place it at the beginning of RAM. data and bss will be then placed above the stack and whatever is left will be "truly wasted" RAM.

This technique is 100% reliable and will catch every stack overflow.

RTOS techniques

An RTOS multiplies the stack issue by the number of tasks. 🙂 Every task has its own stack that has to be managed.

Cortex-M has two stack pointer registers - main stack pointer and process stack pointer. By default the main stack pointer is used for everything (regular code and interrupts). The second stack pointer is very useful with an RTOS. Without it the ISR would use the stack pointer of the currently running task. You would have to over-allocate task stacks using the largest size needed by the task plus the largest size ever needed by an ISR (+nested ISRs, if used). This would be pretty wasteful so the second stack pointer allows the ISRs to use a totally independent stack. The vocabulary is pretty confusing as the main stack pointer is always used in handler mode (ISR) while the process stack pointer can be optionally used in thread mode.

In general: don't worry. FreeRTOS "just works". In a default setup the tasks have their respective stacks and ISRs have the stack that was used before calling the scheduler (ie. the one that main() used).

Self-checking

Most mainstream RTOSes have some kind of stack instrumentation. For example FreeRTOS implements its own stack checking and measurement. Internally this is carried out using the "self checking pre-fill" method or by checking the value of the task stack pointer when switching out the task.

If the stack overflows the RTOS calls vApplicationStackOverflowHook. The checks are carried out only during context switching so an overflow can still damage the application. I found out that this feature works most of the time but sometimes the CPU goes straight into the hard fault handler if the overflow is so severe that it also corrupts the RTOS runtime data and the RTOS can not execute its checking anymore.

My "scientific" method when I get an inexplicable crash with FreeRTOS is to first increase task stack sizes to see if the problem goes away or if the RTOS manages to detect the overflow.

DWT & canaries

Apart from the RTOS facilities, all generic methods still apply. For example if you have 3 tasks you can allocate the stacks statically in regular C arrays placed in a struct for predictable layout with canaries in between:

1
2
3
4
5
6
7
8
typedef struct {
uint32_t canary1;
uint32_t stack_for_task_display[128];
uint32_t canary2;
uint32_t stack_for_task_network[128];
uint32_t canary3;
uint32_t stack_for_task_logging[128];
} my_magic_struct_t;

Canary values can be protected using the DWT method. Why 3? The DWT has only 3 data watchpoints.

This technique is 100% reliable and will catch every stack overflow.

MPU

Finally, the RTOS can use the MPU. The problem is that for best results the application has to be designed with the MPU in mind from day one.

Imagine tasks A & B, each with its own stack, its own static variables and a shared RTOS queue to exchange data. When task A is running the MPU only allows access to task A stack, task A static variables and the queue. The RTOS switches the tasks and reconfigures the MPU to allow access to task B stack, task B static variables and the queue. This is a very simple scenario. Putting task static variables into a struct (or a dedicated linker section) to keep them in contiguous memory is pretty easy. Now imagine that you have more tasks, more queues, more semaphores, and the relation between them is more complicated. And that the regions must be properly aligned. And that the Cortex-M3 MPU can protect only up to 8 memory regions.

FreeRTOS provides privileged and unprivileged tasks so you don't have to protect every possible pair of tasks against each other. For example most tasks can run in privileged mode and access all the memory but only some "risky" ones (eg. dealing with user input or data coming from the network) in unprivileged mode.