M0AGX / LB9MG

Amateur radio and embedded systems

Story from the trenches - uninitialized RAM is not random (enough)

This is another story from the world of "it can't happen". A device was undergoing final testing that spanned many days in different simulated conditions. The testing included power cycling whenever the conditions were changed. Once in a very long while, on the order of maybe once per week, the device failed to communicate with the test equipment.

This was a pretty severe issue because you can't release to market a device that will sometimes obviously fail to start, even if it only happens once every a couple of thousand power cycles. To make matters worse the devices were pretty inaccessible in the test setups as they were placed in climate chambers and only exchanged information with the outside world through one communication interface.

Why would it fail? Bad design? Bad components on the PCB? Cracked solder joints? Environmental factors? Bad wiring? Firmware? Test scripts and setup? Random mishandling & abuse during the very short lifecycle? All information that I received was around 5 data points saying that "these devices failed to start". There was no correlation with the temperature as the device started immediately after another power cycle at the same temperature. It could not be a cracked solder joint. None of the devices failed twice in a row. There was no correlation with the date of manufacture. The suspected devices also worked perfectly fine on the bench and in "real life". Firmware dump was 100% correct. X-rays also did not reveal anything anomalous. It was extremely frustrating as it took around two weeks to produce these very few data points and yet something had to be done about it.

The application used the uninitialized RAM lifehack for debugging breadcrumbs, storing crash handler info etc. but because power cycling was part of the test it was normal to have all this information cleared every time. The firmware also included a bootloader that was used for firmware upgrades. What if the bootloader was starting instead of the application?

Bare metal bootloaders, despite the folk reputation of being somehow "scary", are fundamentally simple. To update the application they have to receive N bytes of new firmware over some communication interface, place it in flash, verify the validity/checksum and jump to the beginning of application code. This simplicity also makes them easy to test. A basic test example is: upload the largest possible valid image, upload the smallest possible valid image, upload an invalid image.

What are the tricky points then? It is the overall logic and overall concept. Imagine a device that is not updated often. The update requires the user to plug in a USB stick with a .bin file and select "Update" in some menu or press a combination of buttons. Compare that with a smart bulb that is updated over Zigbee. In the first case the user can always start over if the update was interrupted in the middle. In the second case there is nothing to press so the process does not allow the user to do anything. It has to be fully automatic.

Some concepts

In random order and with my own "non-scientific" naming. 🙂

"Fixed bootloader with autostart"

The bootloader checks firmware integrity and/or button press at startup, and then launches the application (or waits for new firmware). The bootloader never changes. This is a good concept for a device that is accessible and the user is proficient enough to prepare a USB stick, SD card or connect USB cable and run some app. The application can use the entire flash memory (minus the size of the bootloader) so no flash is wasted and if there is no firmware to update the startup time is instantaneous (mind the checksum verification).

"Fixed bootloader without autostart"

Same as above but the bootloader always waits for a command before starting the application. This is a good concept for a device that is inaccessible but is connected to some master equipment. For example a sensor at the end of a long cable. A very long cable. Hundreds of meters under the surface of the sea. There is nobody to press a button on the PCB in case the application crashes, enters a brick loop etc. but if the device is power cycled the bootloader starts first and waits for commands. The application can use the entire flash memory (minus the size of the bootloader) and there may be some delay during startup as the bootloader needs to receive a command to start the application.

"Dual boot with hardware support"

Some chips, for example the Kinetis K64, have a feature that allows to split the flash into two pieces and swap their addresses. This allows to build the firmware update code into the application itself. The application receives the new image, saves into the "upper half" of the flash, reconfigures the flash to remap the "upper half" map as the "lower half" and resets the MCU. The upper/lower setting (in case of the K64) is held in a totally separate configuration sector.

This scheme is good for example when the communication protocol needs lots of code (like a TCP stack and WiFi baseband firmware). A separate bootloader would consume lots of flash on its own. The downside is that if the new application "goes rogue" there is no way to recover the device in the field unless there is a magic button for booting the old image (but that feature has to work right in the new application).

"Dual boot without hardware support"

If the chip does not support flash remapping the flash can be partitioned into a small bootloader, the first (active) application image and the second (inactive) application image. Bare metal code is usually (for efficiency reasons) built to run at a particular address (for example from flash). This means that a function call to foo() is compiled into something like JMP 0x123456. The magic number is a fixed address somewhere. This means that it is quite hard to build an application that could run from either the first or second location. It is possible but comes with some complications and performance penalties.

To avoid the problems of position-independent code the application can be compiled to run from a fixed address and the bootloader can then verify the checksums of both the active and inactive images at startup. If the inactive image is newer the bootloader erases the active firmware and copies the image from the second location to the first location in flash. The update process is reliable as there is always at least one usable application image.

Bootloader entry methods

The bootloader needs to know when to start the update process and when to start the application. I call these "entry methods", "entry conditions", or "entry stimuli". Entry conditions are always checked by the bootloader at power-on or after a reset.

Application verification

Obviously, the bootloader has to wait for a new application image if the one that is present fails checksum verification. There is no point in starting damaged firmware.

Button press at power-on

The title basically says everything. This is my favourite method (if possible). Very easy to test. Especially useful if the device has a power switch or is powered from USB and is user-accessible.

External storage + power-on

If the startup time is not critical (for example if 2-3 seconds are acceptable) and the device has a connector for removable storage (like an SD card or USB) the bootloader can search the file system for files with magic names and try to check if they contain newer version of the firmware.

Request from the application via memory / mailbox

This very common method uses some kind of memory that survives system reset to tell the bootloader to enter the update mode. First of all - why is a reset needed? The application can put the system into a state that the bootloader was never designed to work with. Therefore a reset (instead of just making a function call to the bootloader) is required to make the system predictable. Another reason is that (if the hardware supports it) the bootloader can lock the flash to prevent the application from damaging the bootloader.

Examples of memories in a microcontroller that survive a reset:

  • Regular MCU RAM. RAM is not set to any particular value and keeps all its contents unchanged during an MCU reset (unless touched by the startup code). Contents are unpredictable at power-on.
  • "RTC domain" that some MCUs provide to keep date and time. This domain often has its own power pin to use with a coin cell battery and some tiny RAM that can be used for anything.
  • Flash. Obviously. I have not used this scheme yet as storing (and erasing) the bootloader entry flags would probably "waste" a whole sector. A couple of bytes in RAM seem more economical.

Back to the original story

The device I mentioned at the beginning of this post used uninitialized RAM not only for debugging breadcrumbs but also for bootloader entry. The breadcrumbs were protected by a strong CRC but because the application had to start as fast as possible (to meet some power-on calibration timing constraints) the bootloader entry was controlled by a single byte. As in uint8_t. The problem? RAM contents are unpredictable at power-on. You probably know where the story goes...

Static RAM (used by most MCUs) is made of flip-flops. At power-on the state of a flip-flop is undefined but because it is a physical digital circuit it must either be a logical zero or one. Depending on manufacturing imperfections the flip-flop may always power-on as one, always as zero, or in any other ratio depending on voltage ramp-up, temperature and other uncontrollable factors. The RAM (as a whole) on every MCU is unique when it comes to these small imperfections so it can be used for device fingerprinting and physically unclonable functions.

The bug "obviously" was that given enough rolls of the dice (RAM power cycles) any magic value would eventually appear in uninitialized RAM by chance. It seems that for that particular MCU getting the right magic 8-bit value to enter the update mode required a couple thousands of power cycles. The bug is obvious in hindsight but not that easy to encounter in development. The device was almost never power cycled during development and the uninitialized part of RAM was cleared by the application if no debugging breadcrumbs were left.

The obvious fix was to make the magic value significantly longer (on the order of 128 bits) to reduce the probability of making it appear by chance. Basically a two line fix for a couple days of debugging time.

Does less than 1 LOC per day count as productive for McKinsey? 🙂