Best practices in firmware
Across the years spent writing firmware I have come across techniques that have helped me with reliability, meeting delivery deadlines, and overall "debug quality of life". In this article I want to focus on developments practices and lifehacks that apply uniquely to firmware and avoid the ones that apply to "big" desktop or server software.
This article will be updated as more ideas become crystallized so please add my blog to your RSS reader. 🙂
Updated on 19.08.2024 : added section about stack usage (at the end of the article). Thanks Alan! 🙂
General advice
Every embedded system is unique. The constraints on a Cortex-M7 and a PIC16 are very different so this is not just a list of things that should be ticked for every project. I also want to focus on techniques that can be applied to any application, as projects using DSP code, control code, and network stacks could have their own specific techniques.
The universal advice that I would apply blindly to ALL software development is only:
- Use version control and write proper commit messages (archive link). Does it still have to be said in 2024?
- Use your compiler to the fullest and build with all possible warnings enabled.
If you absolutely must have dodgy code (or encounter false positives) use
#pragma
s to selectively disable warnings around the particular piece of code or file and comment why. - Use static analyzers.
- Build with a single command. It does not have to be a single
make
invocation. Any single script or command is okay. There is nothing worse (for quality and reproducibility) than having to manually build an application, resources, bootloader and bundle the combinations. - CI/CD. Start with basic build automation (Buildbot, Jenkins), add some test harnesses, and in the end do HIL testing.
Most of the techniques shown in this article could be summarized as "runtime metrics" or "telemetry" (known from "big software") because things happening before release should be dealt with compiler warnings, static analysis, and test suites. It is easy to hook up a trace probe when the system is on your desk and see absolutely everything but that is not possible in the field. Getting the metrics from the device depend of course on the exact system. It can be a debug menu on a screen. It can be a log saved to an SD card. It can be read out over CAN. It can be saved in the internal flash of the MCU. To get as much data as possible the process has to be easy enough either to be done by the user or the device needs some kind of telemetry connection.
Indicate resets and startups
The idea is to have an obvious mechanism that will tell that main()
has just started.
It can be a startup sound, LED blink, splash screen, banner printed to the UART,
or a message sent on the CAN bus.
Why is it important? Of course it helps during development to see that the code has restarted when it was not intended. Maybe due to a crash? Or a watchdog reset? It also helps further down the road when the device leaves your desk. For example a particular batch of devices can reset more often due to manufacturing issues. Without an obvious indication it may be very frustrating to debug such issues. It is also important for the customer to see that the device is starting up when it was not expected. Due to a voltage spike from a nearby motor? Or vibration? This can save countless debugging hours.
All modern MCUs have reset status registers that tell if the device was reset at power-on, due to brownout, by the watchdog, CPU lockup, clock failure etc. This information has to be made accessible. Again, it depends on the application. It can be a different startup sound, bus message, or a debug QR code on a display.
This practice is not applicable to devices where a reset is part of normal operation and there is no state retention between the cycles. For example a smart button or sensor that spends most of its life in very deep sleep mode and only sends one message over the radio when pressed or woken up by a timer. In that case it is normal that the firmware starts from scratch every time.
Uptime
When a device is running continuously the total time from the startup should be counted. It can be in seconds, timer ticks, RTOS ticks, or any other unit. If a device has power saving modes (like tickless idle) then both the active and sleeping times can be counted separately.
Why is knowing the uptime useful? Some problems may appear only after tens or hundreds operating hours. For example due to counter rollovers (see next section), or memory fragmentation from repeated allocations/deallocation, or other bugs. Firmware running continuously for hundreds of hours gives confidence about the general reliability and error recovery. Or, to the contrary, problems popping up always after 7 days hint that the bugs will be hard to reproduce. It is also possible that the device resets due to external circumstances (see previous section).
A short uptime does not have to be bad if it is expected. Imagine firmware that controls some machinery running for two hours every day and switched off when done. This metric is also of little use if the device resets regularly as part of normal operation (see the smart button example from previous section).
Be aware of integer overflow
Fixed width integers will overflow when incremented beyond their maximum value. This can lead to all kinds of bugs, especially in timekeeping. Specimen 1 - Boeing 787
Before doing arithmetic you should check
if the result will fit in the integer type being used
(archive link). Hint: use macros like
UINT32_MAX
from stdint.h
.
How to test counter overflow? For example by settings
the counter variables to some very high value early in main()
so they roll over in a couple of seconds or minutes.
Another common use case of incrementing integers are sequence numbers in communication protocols. They are used to reject packets arriving out of order, reject stale information, or protect against reply attacks. These cases should also be easy to test on the bench. Even if the protocol requires to start counting packets from zero, a debugger test script could be used to set the counter to a very high value mimicking packet loss.
Event counters
The name is entirely made up by me. This method could be called "performance counters" or "poor man's code coverage". The basic idea is to declare an integer variable and increment it whenever "something interesting" happens.
Rough sketch:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
|
As simple as that. It is somewhat similar to what tracing frameworks like
SystemView do
but without immediate upload to a PC.
The overhead of storing and incrementing a uint32_t
is negligible on
modern 32-bit CPUs. Of course the counters could be stored in a single struct
and perhaps hidden behind some elegant macros.
I would split this technique into roughly two main groups: counters that are incremented whenever an event happens ("pure" counters) and "runtime metrics". The second group includes things like peak latency, min/max response time, or average frame processing time. Naturally, these operations require more math so their overhead is larger.
What are the "interesting events"? Here are some examples:
- Received/sent packet count. On any interface that handles traffic, eg. wireless, CAN bus etc.
- Bad packet count, packets with wrong checksums, packets with wrong data.
- SPI, I2C, and other operations timeouts (see next section)
- Checksum failures
- Error handling branches
- Response latency. For example: save the system tick count in the ISR and then read in the deferred interrupt handler.
- Reset count and uptime
Hint: every error handling branch should have its own counter. Perfect to observe that your test suite also covers error handling. 🙂
Safety timeouts and "interlocking"
Every operation that does not complete immediately should have an explicit timeout and a timeout handler function. It should be impossible to start another operation before the previous one completes (or fails, or times out).
Some examples:
- Waiting for flash write or erase
- Accessing all kinds of external SPI and I2C peripherals (clock stretching!)
- Waiting for responses from other devices over a network or bus etc.
- Waiting for any operation that is expected to deliver an IRQ
Example: a radio where you write some data and the radio delivers an IRQ when the data is sent. What if the IRQ never arrives?
Of course include a counter for every timeout operation. 🙂
Board self-test
Try to test as many elements of the system as possible during startup and indicate in some meaningful way (if the startup time and power budget allow). This saves debugging time when dealing with broken boards, manufacturing changes, and field failures.
It might not be possible to self-test every PCB end-to-end due to the design however checking that all peripherals on the I2C and SPI buses are present and respond correctly is trivial. This can be done at startup even if they are not immediately needed. If possible, try to test every external IRQ line as well.
War story
I worked on a device that had an SPI flash. I got info that a whole batch of newly manufactured devices do not work. Thanks to a self-test at startup I was able to immediately determine that the SPI flash had been substituted with a different device (different JEDEC ID). It turned out to be harmless in the end. The substitute part had the same size, same command set, and same timings but it was not approved in the BOM and required some extra contractual sign-off.
Checksum structs and binary data
All structs that that leave RAM to be saved to external memories (flash/EEPROM) or go to communication interfaces should be checksummed. There is no good way to be sure without a checksum that the data in a struct or binary blob coming from the "outside world" is valid. An example can be settings that are saved to flash. Even if you validate every parameter you can't be sure that 0x00 or 0xFF was the intended value and not just empty flash.
CRC hardware is nowadays included even in low-end MCUs so use it to the fullest. Doing software CRC is also not the end of the world unless the MCU is on the extremely low-end.
This is a typical struct template that I tend to use for storing settings in flash:
1 2 3 4 5 6 7 8 |
|
Calculating the CRC then looks like crc32(&my_struct, sizeof(my_struct) - sizeof(uint32_t));
.
The CRC is computed across the whole struct except for the stored CRC itself.
This is where the minus one word comes from.
Forward-compatible structs
Copying structs between memories is very efficient. The price is that the binary layout of the stored data and what the (current) code expects must be exactly the same. What if the settings struct form previous section needs more fields? Changing the size of the struct will make the CRC fail. You can try to just drop backward compatibility but that is acceptable only during early development. It quickly becomes tedious to have to reprogram the same MAC address every time a new settings field is added.
There is no magic way for automatic handling of different revisions of a struct
but at least the magic and CRC can be stored at the same place to avoid
having to define settings_v1_t
, settings_v2_t
etc.
The hack is to use an anonymous union with a placeholder:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
How much space to reserve? I think it is best to align it with the underlaying
flash sector size (or the smallest erasable chunk) and dedicate the whole
sector to settings.
If the sector size is 1 KB and the settings actually occupy
200 bytes I would still reserve 1 KB for the struct to keep the code simple.
The magic
field can be different for different versions of the struct.
Warnings for struct padding
When you define a struct like this
1 2 3 4 5 6 |
|
what the compiler will actually deliver looks like this:
1 2 3 4 5 6 7 8 9 |
|
The extra bytes are called the padding and they are needed for memory alignment. Aligning data is generally good for performance. Sometimes it is also a necessity (for example the Cortex-M0 can not do misaligned accesses).
What are the implications? The first one is that the offsets of some fields may not be what they appear. As long as the same code reads and saves the data this is not a problem. The second is the obvious waste of space. For structs that don't have many instances it is not an issue. This waste might be relevant for structs that have many instances, for example if you are dealing with a small GUI that has around 128 messages in 20 languages. Saving 8 bytes with better struct packing means total savings of 20 KB.
How to spot struct padding?
I surround only the relevant structs with a #pragma
to get a build error.
It is enough to do it around the typedef
.
Adding -Wpadded
to compilation flags does not work in practice because it
usually generates a sea of messages from libraries, RTOS, and middleware.
1 2 3 4 5 6 7 8 9 |
|
You can also force the compiler to not pad the struct by using the packed
attribute
but this will lead to less efficient code, and possibly misaligned access if you
try to use raw pointers to members of the struct.
Suffixes for physical units
Embedded systems deal with the "real world". And the real world is measured and described using various units of time, voltage, distance, temperature etc. All variables used for handling anything that has to do with physical measurements should be suffixed in a way that makes the unit obvious. This applies both to fixed-point and floating-point code. I tend to err to the side of long variable names because autocomplete and high-resolution displays make it easy.
Examples:
battery_voltage_mV
battery_voltage_100mV
timeout_ms
temperature_01C
(temperature in units of 0.1 degrees Celsius, ie. 253 = 25.3 degrees Celsius).
Units don't have to be strictly SI. Suffixes can also contain extra information. Think of all the ways to define "voltage" or "decibels". Examples:
delay_ticks
duty_percent
(as in PWM duty cycle)adc1_reading_Lsb
amplitude_mVpp
(amplitude in millivolts peak-to-peak)gain_mV_Lsb
(gain, or sensitivity, in millivolts per Lsb)
This naming convention has the small extra benefit that if you are using fixed-width
integer types (like uint16_t
) and decimal scaling you can (at least roughly) imagine
if the number will fit in the variable range. Temperature calculated in 0.1 degrees Celsius
in a uint8_t
can only reach to 25.5.
Prepare for crashes - debugging breadcrumbs and logging
Nothing is perfect so prepare for firmware crashes early. Cortex-M has a dedicated hard fault handler for dealing with many problems. The handler can do a lot of things that regular interrupt handler code can, and this includes leaving some clues as to what could caused the fault. I call this general approach "breadcrumbs" because it is more limited than a full crash log or core dump. I also use this approach for assertions and all "can't happen" occasions.
Breadcrumbs can be stored for example in a reserved piece of RAM, RTC memory, (that is not reset), external memories, or flash, but watch out because after a crash you can not count on everything behaving as expected. The simpler the storage mechanism the better.
Use integer types from stdint.h
When dealing with integer arithmetic avoid types like int
, short int
, long int
,
and always use types from stdint.h
that specify their exact width
to avoid being surprised by overflows.
These include types like uint32_t
, int16_t
, uint8_t
etc.
How big is a long long int
anyway?
My rite of passage was discovering (the hard way) that code using int
ran fine on a PC (that was 32-bit back then) but failed on an AVR.
int
in avr-gcc
is 16-bit (even though the CPU is 8-bit).
From that moment I started to exclusively use types from stdint.h
whenever the width is important. This also helps with portability.
I do not exactly mean changing MCUs in the middle of the project
but here are some practical examples:
- Moving code between 8-bit, 16-bit, 32-bit MCUs (okay, pretty rare...)
- Moving code between a PC (amd64) and an MCU (32-bit? 8-bit?). This especially applies to code that is tested in a harness running on a PC.
- Moving code between 32-bit ARM and 64-bit ARM
Keep in mind that the smallest type does not have to be the fastest.
Usually the "native" size is the fastest. For example 32-bit ints
on Cortex-M. A narrower type may save data memory but will increase
code size because the code will have to do extra operations (masking etc.).
stdint.h
provides "fast" types like int_fast16_t
which have guaranteed
minimum width but can be wider if it makes them faster. Of course don't
rely on any particular width or wraparound behaviour.
stdint.h
also provides handy macros with maximum sizes of the types, for example
INT16_MAX
. They can be used to check
if the arithmetic result will fit in a particular type.
Heartbeat signal
Every device should have an easy way to discover if it is "still alive". It can be a blinking LED for a human to look at, or it can be some kind of periodic keepalive message sent on a bus. This is especially useful for networked devices as you can see if there are any patterns when the devices stop communicating. It is easier to do remotely than have to analyze event counters or other breadcrumbs. If power supply is the problem then all you would get from the other debugging techniques would be only a power-on reset event.
Be aware of the stack (added on 19.08.2024)
Automatic variables and function calls in C are most often implemented using a stack. Obviously the data has to be physically stored somewhere in RAM during runtime. On a PC with a full operating stack usage limits are rarely a problem (apart from security issues) as the OS will easily give you gigabytes of memory. However, embedded systems are more limited so stack usage needs some care.
Here is my complete article about dealing with the stack on Cortex-M.