M0AGX / LB9MG

Amateur radio and embedded systems

Reducing firmware size by removing libc

The C standard library (libc) is a component that gets little attention. It is just there. However for embedded systems it brings some challenges and overhead in terms of code size. As firmware size is often critical, it sometimes makes sense to use a trimmed version of the standard library or to remove it entirely. I will focus on reducing the code size that may be beneficial for a small application like a bootloader.

I use GCC 7.2.1 and newlib from GNU Arm Embedded. The target MCU is a Cortex-M3. Everything is compiled with -Os -flto. I compared building my bootloader with nosys.specs (full flavor of newlib), nano.specs (trimmed flavor of newlib) and no libc at all (with two version of the needed standard functions).

The bottom line is:

Variant Size
--specs=nosys.specs 13716
--specs=nano.specs 13296
-nostdlib -freestanding, libc functions from Apple 13100
-nostdlib -freestanding, bare-bones libc functions 12864

852 bytes can easily be saved without sacrificing any functionality. How to do it and what are the tradeoffs?

Most of library functions are written to be universal, portable and have best average performance. Newlib is already quite optimized for embedded systems (compared to glibc...), however there is still much that can be removed. Things like standard input/output just do not exist in an MCU, printf is also a prime source of bloat. These features are definitely not needed by a bootloader.

GNU linker needs --specs as one of its arguments. The specs define which features of the standard library will be available. Newlib provides nosys.specs and nano.specs (there are also rdimon.specs).

nosys.specs provide the full feature set and require the largest amount of code.

rdimon.specs is a flavor that is used for semihosting. Semihosting is a technique where a function is called on a microcontroller but then (at least partially) executed on a PC running a debugger. For example the microcontroller can do an fopen on a file present on a PC, which would not be normally possible. Of course the firmware built for semihosting will not be able to run on its own (it will likely crash when calling functions requiring debugger support). nano.specs provide less features (like less functional printf).

How to remove libc?

Simply compile with -nostdlib and link with -nostdlib -freestanding. Very soon you will run into linker troubles. First of all - functions line memcpy and memset are used by my bootloader so they somehow must be provided. There are basically two ways - either write them yourself or use someone else's implementation. A good source can be... Apple! Apple uses a lot of BSD code so their core source code is freely available. Examples: memcpy, memset. These functions can be simply dropped into your project.

Minimal functions

memcpy is a very simple function that copies N bytes from one place to another. This can be done in many ways. A naive implementation can copy byte-by-byte:

1
2
3
4
5
6
7
8
__attribute__((weak)) void* memcpy(void *dst0, const void *src0, size_t length){
    char *dst = (char*)dst0;
    const char *src = (char*)src0;
    while (length--){
        *dst++ = *src++;
    }
    return dst0;
}

This is of course wasteful on a 32-bit CPU that can transfer 4 bytes at a time so if the source and destination pointers are word-aligned, data can be copied 4 times faster as whole words rather than bytes (+/- the last chunk). This of course leads to a larger function with more code so memcpy can be optimized for either code size or execution speed.

Similarly a bare-bones memset can look like this:

1
2
3
4
5
6
7
8
__attribute__((weak)) void *memset(void *dst0, char c, size_t length){
    char *dst = (char*)dst0;
    while (length--){
        *dst = c;
        dst++;
    }
    return dst0;
}

In my bootloader project I do not have to care much about speed, because it is limited by flash erase and write times anyway so optimizing for code size makes more sense.

The weak attribute

To simplify the build process it is nice to keep all the files and not make special exceptions when to compile the extra libc functions and when not. GCC supports the weak function attribute. A function can be defined this way: __attribute__((weak)) void *memset(void *dst0, char c, size_t length)

There can be only a single function with a particular name in a C application. This attribute allows the linker to throw away a function if another one (without the weak attribute) is build. This is a nice trick - you can still have your own libc functions in the application but if you need to build a full version of the standard library, they will be "overwritten" by functions from the standard library.

Startup code

After solving issues with the functions that are in use by the application the linker will very likely complain about missing functions like _exit and _start. These functions are the glue between the ARM startup code, standard library and your main() function.

The usual Cortex-M startup sequence begins with reading the initial stack pointer (first word in flash) and jumping to address specified in the reset vector (second word in flash). A reset handler (in case of the EFM32) is juse a regular C function:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
/*----------------------------------------------------------------------------
  Reset Handler called on controller reset
 *----------------------------------------------------------------------------*/
void Reset_Handler(void) {
  uint32_t *pSrc, *pDest;
  uint32_t *pTable __attribute__((unused));

#ifndef __NO_SYSTEM_INIT
  SystemInit();
#endif

/*  Firstly it copies data from read only memory to RAM. There are two schemes
 *  to copy. One can copy more than one sections. Another can only copy
 *  one section.  The former scheme needs more instructions and read-only
 *  data to implement than the latter.
 *  Macro __STARTUP_COPY_MULTIPLE is used to choose between two schemes.  */

#ifdef __STARTUP_COPY_MULTIPLE
/*  Multiple sections scheme.
 *
 *  Between symbol address __copy_table_start__ and __copy_table_end__,
 *  there are array of triplets, each of which specify:
 *    offset 0: LMA of start of a section to copy from
 *    offset 4: VMA of start of a section to copy to
 *    offset 8: size of the section to copy. Must be multiply of 4
 *
 *  All addresses must be aligned to 4 bytes boundary.
 */
  pTable = &__copy_table_start__;

  for (; pTable < &__copy_table_end__; pTable = pTable + 3)
  {
    pSrc  = (uint32_t*)*(pTable + 0);
    pDest = (uint32_t*)*(pTable + 1);
    for (; pDest < (uint32_t*)(*(pTable + 1) + *(pTable + 2)) ; )
    {
      *pDest++ = *pSrc++;
    }
  }
#else
/*  Single section scheme.
 *
 *  The ranges of copy from/to are specified by following symbols
 *    __etext: LMA of start of the section to copy from. Usually end of text
 *    __data_start__: VMA of start of the section to copy to
 *    __data_end__: VMA of end of the section to copy to
 *
 *  All addresses must be aligned to 4 bytes boundary.
 */
  pSrc  = &__etext;
  pDest = &__data_start__;

  for ( ; pDest < &__data_end__ ; )
  {
    *pDest++ = *pSrc++;
  }
#endif /*__STARTUP_COPY_MULTIPLE */

/*  This part of work usually is done in C library startup code. Otherwise,
 *  define this macro to enable it in this startup.
 *
 *  There are two schemes too. One can clear multiple BSS sections. Another
 *  can only clear one section. The former is more size expensive than the
 *  latter.
 *
 *  Define macro __STARTUP_CLEAR_BSS_MULTIPLE to choose the former.
 *  Otherwise efine macro __STARTUP_CLEAR_BSS to choose the later.
 */
#ifdef __STARTUP_CLEAR_BSS_MULTIPLE
/*  Multiple sections scheme.
 *
 *  Between symbol address __zero_table_start__ and __zero_table_end__,
 *  there are array of tuples specifying:
 *    offset 0: Start of a BSS section
 *    offset 4: Size of this BSS section. Must be multiply of 4
 */
  pTable = &__zero_table_start__;

  for (; pTable < &__zero_table_end__; pTable = pTable + 2)
  {
    pDest = (uint32_t*)*(pTable + 0);
    for (; pDest < (uint32_t*)(*(pTable + 0) + *(pTable + 1)) ; )
    {
      *pDest++ = 0;
    }
  }
#elif defined (__STARTUP_CLEAR_BSS)
/*  Single BSS section scheme.
 *
 *  The BSS section is specified by following symbols
 *    __bss_start__: start of the BSS section.
 *    __bss_end__: end of the BSS section.
 *
 *  Both addresses must be aligned to 4 bytes boundary.
 */
  pDest = &__bss_start__;

  for ( ; pDest < &__bss_end__ ; )
  {
    *pDest++ = 0ul;
  }
#endif /* __STARTUP_CLEAR_BSS_MULTIPLE || __STARTUP_CLEAR_BSS */

#ifndef __START
#define __START _start
#endif
  __START();
}

EFM32 does not have any special startup requirements (like initialization of clocks, PLLs, memories etc.). Everything starts in a state that can execute C code right away (of course later on the hardware has to be configured to use the right clocks, peripherals, memories etc.). The important steps in EFM32 startup code are:

  • copying values of the data section (all global variables with defined values)
  • zeroing of the BSS section (all global variables without values)
  • calling _start

It is important to know whether it is the startup code that does data and BSS initialization or the library code. I had to add -D__STARTUP_CLEAR_BSS=1 after removing libc from the build to make the startup code do the initialization.

Missing startup functions

Newlib using nano.specs needs _start and _exit functions to be declared. The _start function needed by my bootloader simply calls main:

1
2
3
__attribute__((weak)) void _start(void){
    main();
}

and the _exit does nothing:

1
2
3
__attribute__((weak, noreturn)) void _exit(int a __attribute__((unused))){ //needed by linker nano.specs
    __builtin_unreachable();
}

Having included the above functions the application should link correctly. All the steps allowed me to save 852 bytes of flash. It may seem hardly worth it but it also means that the (USB) bootloader needs just 13KB, not 14KB. This in turn leads to 1KB (due to flash erase/write organization) more being available for the application. :)