Bug of the day: missing entry point
This is a tale of a strange bug I ran into when doing my first ever SAM E70 project. A new MCU. A new type of product. A custom PCB. What could possibly go wrong? 🙂
Spoiler: Not strange. Not a bug.
I started by using Atmel's START tool (pun intended). It is a web-based project generator.
I selected my MCU, picked the clock configuration, added some extra components (like the USB stack) and got a zip
file
in the end. This tool hits the sweet spot as it does not require you to download many gigabytes of SDKs and IDEs,
yet gives you a nice minimal project to start with (the headers + startup code + clock configuration + basic Makefile).
START is still available but is deprecated (as of 2023).
The project built nicely with gcc
. I connected a J-Link to my custom board. I opened the elf
in Ozone,
flashed the MCU, let it run and... immediately ended up in the hard fault handler. Not great, not terrible.
The firmware crashed instantly but on the other hand I could connect with a debugger and flash the chip
without any errors.
Ozone by default after flashing halts on main()
so the startup code is executed but not any of the "application" code.
This initially led me to suspect the startup or the clock setup code. In the hard fault handler the CPU behaved normally.
I was able to single step the instructions so the hardware could not be that bad. Unfortunately, all the diagnostic
variables in the hard fault handler had only junk addresses so I could not see an obvious cause for the crash.
I started checking everything I could think of:
- Supply voltages - nominal
- Crystal signal and frequency - nominal
- Solder joints - good
gcc
build flags - all good for Cortex-M7- Part number on the package vs. the one in START - exact match
- Ozone MCU selection - exact match
- Oscillator config in START - matched the PCB
- Reset handler code - seemed to make sense
- Vector table contents - good (initial stack pointer at top of RAM and good reset handler address)
- Binary inspection in hex editor - nominal
After exhausting all options I added code that changes state of GPIOs at various stages to "debug with a scope" or "debug with an LED" in order to find the last working step. To my surprise the code worked after flashing and a power cycle. The code must have been correct all the time. It must be the debugger that does something wrong.
To confirm it I loaded a raw binary file into Ozone. It does not have any metadata so I had to input the load address manually. I also manually typed the initial stack pointer and program counter into the registers using the debugger. Then, I let the CPU run. And it worked!
This discovery made me certain that is must be either something with Ozone or with the
elf
file. On one hand the elf
was perfectly loaded into flash so the linker script could
not be that broken (it was provided by Atmel START anyway).
On the other hand Ozone could not reach main()
. The linker script looked sane at first
glance. It took me quite a while to spot the problem. A missing ENTRY(Reset_Handler)
.
It seems that Atmel START must have been using a wrong linker file template.
After adding the missing entry point to the linker script Ozone was able to start the CPU properly and reach main. Yay! 🙂
Why did the broken elf
(partially) worked?
An elf
does not need an entry point to start with. An example can be a shared library
(though they have standardized "constructors" which are a kind of entry points for the loader
and operating system).
The layout of the sections was specified correctly in the linker script so all the code
ended up in flash at the correct address. The code used RAM at the correct address.
The vector table had references to Reset_Handler
so it was not optimized out during linking.
Basically the .hex
or .bin
file was fully usable when it ended up in the flash of the MCU.
That day I learned that the only problem was Ozone not being able to recognize where the firmware should start from.