Debugging runtime memory corruption on Cortex-M
Runtime memory corruption is one of the worst class of bugs a C/C++ application can have. I do not mean design problems like abuse of global variables but seemingly correct code clobbering memory it should never touch (for example due to runaway pointers). Compared to "regular" crashes that are obvious and much simpler to fix (even if they are rare they leave a stacktrace), memory corruption is often silent. It can go unnoticed for a long period and manifest itself in subtle ways. For example: the application sometimes acts weirdly or a particular variable is sometimes wrong. Fortunately Cortex-M3 and M4 cores are equipped with special hardware that can assist in catching rogue memory accesses.
An obvious approach is of course to use the data watchpoint feature of any decent debugger to catch the code that does the improper write. It not always possible to keep the system running under debugger control on a desk for very long (if the corruption happens very sporadically) or the final device setup can not be replicated (for example: it depends on a part of the customer's plant... which may be hard to fit into the office). In such cases the firmware itself needs to be instrumented enough to assist in debugging.
Memory protection unit
The MPU is a peripheral that is specifically designed to control access to various areas of the address space. When a protected area is accessed (read and/or write depending on MPU configuration) the MPU triggers an interrupt that has to decide what to do with the memory protection fault. An MPU is however not a universal solution. First of all it is an optional peripheral so may not be present in the MCU you are using. The firmware must also be specifically architected to take advantage of the MPU from day one. MPUs usually work only with larger blocks of memory (eg. 256 bytes) so it is impractical to protect just a single variable.
If firmware was not designed with the MPU in mind, restructuring will change the placement of variables in memory so instead of a known corruption of some variable(s) other data will be clobbered by the same buggy code. This may also make the initial problem untraceable because clobbering of other variables may lead to rare or more subtle symptoms.
Data watchpoint and trace unit
Cortex-M3 and M4 cores have a DWT unit. It can be used to set up breakpoints and watchpoints. Watchpoint (also called a data or memory breakpoint) is triggered when a particular address is read or written by the CPU. The DWT has 3 address comparators that allow to set up to 3 watchpoints total. When a value is matched the DWT simply triggers a
DebugMon interrupt. Due to its simplicity it is the ideal tool to catch memory corruption at a late stage of firmware development.
Setting up watchpoints at runtime
The following code allows to enable and disable a watchpoint for a
uint32_t-type variable (ie. the address is matched across all 32-bits) so it is useful for variables that occupy at least 4 bytes and are 4-byte aligned. For example:
uint32_t, array or struct of
uint32_t. It is best to check the address of the variable (and adjacent variables) in the map file or debugger to be sure that the watchpoints will not be triggered when accessing adjacent variables.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
The handler is "just" a regular interrupt handler based on this very good hard fault handler. Whenever the watchpoint is hit all CPU state is available to be saved and analyzed later on. The code should be extended to save the debugging breadcrumbs, make the device safe and reboot.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
How to use it?
The basic pattern is:
1 2 3
Simple enough - allow write access to the variable in the intended place(s). Now whenever the variable is accessed outside this section the
DebugMon interrupt will be triggered. The section that allows modification of the variable should be as short as possible to eliminate the opportunity window when the variable can be written. If a section is too long an interrupt or another RTOS task can clobber the variable without triggering the handler.
There can be multiple such sections for a single variable but there are only 3 hardware watchpoints so you have to use different watchpoint numbers for different variables and of course only 3 variables can be protected this way.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
I used a struct to ensure the order of variables in memory. The legal operation unlocks the variable, does it job and locks it again. Another function copies a string but does not check the length. If a string is too long it will try to overrun the protected variable leading to triggering of the fault handler.
That was the easy part...
How to analyze the breadcrumbs?
The relevant variables in the fault handler are:
- program counter (
- link register (
- watchpoint number
The watchpoint number is of course needed to know which variable is affected. The program counter tells which instruction tried to access the protected variable. What if it points to functions like
strncpy that are used all over the place? The link register tells where that function was called from.
The program counter (and link register) can be mapped back to source code via a disassembly file. Most debuggers (for example Ozone) can also show the disassembly view.
A more realistic example
Let's assume a simple data acquisition application that has the following features:
- ADC task that samples data
- Processing task that does some calculations on the ADC data
- UART task that allows to read the calculated values
- User interface task controlling a display and a keypad
The user reports that sometimes the displayed data is wrong and does not make a physical sense (like 500% humidity or temperature below absolute zero). You make a firmware release that has the particular variable protected with watchpoints. After 2 weeks of uptime it turns out that one of the variables in the processing code gets clobbered by UART code. For example: the UART stores received bytes not in its buffer but somewhere in the processing task's data.
If the UART code is obviously bad, then the fix is easy and you are "lucky". But what if the UART code is correct? For example if it stores received bytes via a pointer, this can only mean that UART variables are clobbered by something else. So you have to release yet another firmware that has UART variables protected this time.
In the end it may turn out that, for example, the UI code was bad - it clobbered UART driver variables when a particular sequence of menus was entered too fast, which in turn lead to good UART code destroying the measurements (and at the same time the UART was running correctly but with its buffers located in the wrong place).
In large and complex firmware projects such chain of events can have several links so the best approach is to move the watchpoint instrumentation from the obvious symptoms up to the root cause.