M0AGX / LB9MG

Amateur radio and embedded systems

Fixing Cortex-M startup code for link-time optimization

Link-time optimization is a powerful output size reducing feature. Even though (as of 2018) still regarded as somewhat experimental, LTO is worth trying, if the binary size is very important and the application can be reliably tested afterwards, as link-time optimized code is hard to debug. A bootloader can be an ideal example. LTO is very easy to enable but there are some small quirks that have to be taken care of. I will use GCC 7.2.1 from GNU Arm Embedded as an example.

What is LTO?

All C (and C++) applications begin their lives as a bunch of source files (eg. .c), that are compiled separately (by gcc) into object code (.o files). Object code files are then linked into a single binary (eg. .elf). Optimization is usually applied in the compilation step so for example the compiler can inline a function that is used only once, or is very small. During linking phase the only basic optimization performed by the linker is throwing out unused symbols (ie. functions or data).

If a function from one module is used only once in another module it will still include a complete function call. A function call is usually "cheap" in modern processors (the state of the caller must be preserved on the stack) but nevertheless it costs something in terms of extra instructions. Here comes the LTO! With LTO enabled the linker is able to optimize the application across module boundaries. For example, a function used only once in another module can be inlined, which leads to output code reduction and better speed.

Enabling LTO in GCC

To enable LTO you have to simply add -flto option to your CFLAGS and LDFLAGS, then recompile everything. As simple as that... almost.

When building my Cortex-M bootloader without LTO the output size was 15032 bytes, with LTO enabled it was a mere 8648 bytes. A 50% code size reduction is too good to be true. I tried running the binary on the microcontroller but it did not even bother to crash. Something was evidently wrong but how do you debug raw machine code? It is not even assembly... Let's fire up the hex editor!

Correct binary without LTO

correct memory dump without LTO

What can be seen here? Well... some binary data but it definitely does not look random. If you know the Cortex-M architecture you can recognize the first word (4 bytes) as the initial stack pointer and second word as the address of the reset function (reset vector). Usually the beginning of the binary is used for the interrupt vector table. The reoccurring values are just addresses to the same default handler.

Wrong binary with LTO

wrong memory dump with LTO

Something is very different from the correct binary - data from the very beginning looks "random". This would mean that the linker removed the whole interrupt vector table. With the reset vector and stack pointer missing the code is pure junk. The linker may have been over-aggressive because, from the point of view of the whole application, the vector table looks totally useless. No application functions call any of the vectors (only the hardware does). The vector table is not referenced anywhere in the application code.

How to fix it?

I have looked at the startup code of my MCU (EFM32GG). The vector table is nicely laid out in a C array:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
/*----------------------------------------------------------------------------
  Exception / Interrupt Vector table
 *----------------------------------------------------------------------------*/
const pFunc __Vectors[] __attribute__ ((section(".vectors"))) = {
  /* Cortex-M Exception Handlers */
  (pFunc)&__StackTop,                       /*      Initial Stack Pointer     */
  Reset_Handler,                            /*      Reset Handler             */
  NMI_Handler,                              /*      NMI Handler               */
  HardFault_Handler,                        /*      Hard Fault Handler        */
  MemManage_Handler,                        /*      MPU Fault Handler         */
  BusFault_Handler,                         /*      Bus Fault Handler         */
  UsageFault_Handler,                       /*      Usage Fault Handler       */
  Default_Handler,                          /*      Reserved                  */
  Default_Handler,                          /*      Reserved                  */
  Default_Handler,                          /*      Reserved                  */
  Default_Handler,                          /*      Reserved                  */
  SVC_Handler,                              /*      SVCall Handler            */
  DebugMon_Handler,                         /*      Debug Monitor Handler     */
  Default_Handler,                          /*      Reserved                  */
  PendSV_Handler,                           /*      PendSV Handler            */
  SysTick_Handler,                          /*      SysTick Handler           */

  /* External interrupts */

  DMA_IRQHandler,                       /*  0 - DMA       */
  GPIO_EVEN_IRQHandler,                       /*  1 - GPIO_EVEN       */
  TIMER0_IRQHandler,                       /*  2 - TIMER0       */
  USART0_RX_IRQHandler,                       /*  3 - USART0_RX       */
  USART0_TX_IRQHandler,                       /*  4 - USART0_TX       */
  USB_IRQHandler,                       /*  5 - USB       */
  ACMP0_IRQHandler,                       /*  6 - ACMP0       */
  ADC0_IRQHandler,                       /*  7 - ADC0       */
  DAC0_IRQHandler,                       /*  8 - DAC0       */
  I2C0_IRQHandler,                       /*  9 - I2C0       */
  I2C1_IRQHandler,                       /*  10 - I2C1       */
  GPIO_ODD_IRQHandler,                       /*  11 - GPIO_ODD       */
  TIMER1_IRQHandler,                       /*  12 - TIMER1       */
  TIMER2_IRQHandler,                       /*  13 - TIMER2       */
  TIMER3_IRQHandler,                       /*  14 - TIMER3       */
  USART1_RX_IRQHandler,                       /*  15 - USART1_RX       */
  USART1_TX_IRQHandler,                       /*  16 - USART1_TX       */
  LESENSE_IRQHandler,                       /*  17 - LESENSE       */
  USART2_RX_IRQHandler,                       /*  18 - USART2_RX       */
  USART2_TX_IRQHandler,                       /*  19 - USART2_TX       */
  UART0_RX_IRQHandler,                       /*  20 - UART0_RX       */
  UART0_TX_IRQHandler,                       /*  21 - UART0_TX       */
  UART1_RX_IRQHandler,                       /*  22 - UART1_RX       */
  UART1_TX_IRQHandler,                       /*  23 - UART1_TX       */
  LEUART0_IRQHandler,                       /*  24 - LEUART0       */
  LEUART1_IRQHandler,                       /*  25 - LEUART1       */
  LETIMER0_IRQHandler,                       /*  26 - LETIMER0       */
  PCNT0_IRQHandler,                       /*  27 - PCNT0       */
  PCNT1_IRQHandler,                       /*  28 - PCNT1       */
  PCNT2_IRQHandler,                       /*  29 - PCNT2       */
  RTC_IRQHandler,                       /*  30 - RTC       */
  BURTC_IRQHandler,                       /*  31 - BURTC       */
  CMU_IRQHandler,                       /*  32 - CMU       */
  VCMP_IRQHandler,                       /*  33 - VCMP       */
  LCD_IRQHandler,                       /*  34 - LCD       */
  MSC_IRQHandler,                       /*  35 - MSC       */
  AES_IRQHandler,                       /*  36 - AES       */
  EBI_IRQHandler,                       /*  37 - EBI       */
  EMU_IRQHandler,                       /*  38 - EMU       */

};

The fix was to simply change the attribute: __attribute__ ((section(".vectors"))) to __attribute__ ((section(".vectors"), used)). The used attribute tells the whole toolchain that this symbol is really in use and should not be optimized out.

Correct binary with LTO

correct memory dump with LTO

After a recompile the beginning of the binary seems to contain the vector table. The binary runs fine on the microcontroller. The output size is 13100 bytes (vs. 15032 originally) so 1932 bytes were saved. This is something worth fighting for in a bootloader.

What are the downsides? The output code is "heavily digested" by the toolchain so it will be much harder to debug. Adding both -g (include debugging symbols) and -flto option to GCC is officially experimental so the results can be unpredictable. For example Ozone can not read the debugging symbols after LTO.

In my case I developed the bootloader without LTO. When everything seemed done I build it with LTO and tested. It is manageable to test a bootloader, because the functionality is very limited. My tests basically consisted of loading several application binaries of different sizes (like a complete block, a block minus one byte, plus one byte, maximum application size) and dumping the flash memory with a debugger after every step. The bootloader worked well. :)