M0AGX / LB9MG

Amateur radio and embedded systems

Practical comparison of ARM compilers

My inspiration for this article was this thread in r/embedded. There are still many myths and misconception when it comes to the simple questions "Which compiler should I use?" or "Which compiler is the best?". I will try to share my practical experience with GCC, Clang and IAR on Cortex-M (and Cortex-M only). This article does not cover IDEs or debuggers. All three toolchains generate standard ELF files that can be used by gdb (with its countless GUIs) and Ozone, and all three toolchains can work with any IDE/editor.

Comments to this article are on Hacker News.

TL;DR GCC, Clang and IAR are all good for Cortex-M.

I can honestly say that (in the 2020s) GCC, clang and IAR are all rock-solid on Cortex-M. Throughout my career I have been mainly dealing with "control-style code" with lots of branches, events, communication and interfacing. I found the code generated by different compilers performing almost identically. This may not hold for signal processing code. If your project relies on "heavy" algorithms you should be benchmarking it across compilers, different optimization levels, different compiler flags, versions of the same compiler to see which one works best.

The issues I ran into where the toolchain was a contributing factor could be roughly categorized into:

  1. "Obviously" bad code that changed behaviour depending on the particular compiler version
  2. "Obviously" bad code that changed behaviour depending on optimization level
  3. "Obviously" bad code that changed behaviour depending on compiler flags
  4. "Obviously" bad code that depended on undefined behaviour
  5. "Non-obviously" bad code that depended on implementation-defined behaviour of the compiler

What is a toolchain anyway?

The "traditional" or "theoretical academic" embedded toolchain consists of a compiler + assembler + linker. In practice there are more pieces to the puzzle:

  • Editor or IDE (example: IAR Embedded Workbench)
  • Build system (example: GNU Make or iarbuild)
  • Debugger (example: gdb, Eclipse plugins for gdb, SEGGER Ozone etc.)
  • C standard library (example: newlib)
  • Linker scripts
  • MCU startup code
  • C runtime initialization code
  • Libraries (example: CMSIS)
  • "Small tools" like size, objcopy, objdump, srecord, ar
  • "Large tools" like static analyzers (eg. C-STAT, clang-tidy), linters, coverage analyzers, profilers
  • "Middleware", "big" libraries, network stacks, file systems etc.
  • RTOS
  • All the things needed for C++...

It may be hard to distinguish where the toolchain ends, where the libraries start etc. You often get a whole package from the chip vendor like Simplicity Studio from Silabs that has the entire GCC toolchain, Eclipse-based IDE and also all kinds of headers, libraries and network stacks for their chips.

If you pick the stock Arm GNU toolchain toolchain (formerly called GNU Arm Embedded) you will only get the GCC compiler, GNU binutils (eg. ld linker), gdb, C runtime init code (crt0), newlib C standard library. You will have to get the device headers, startup code and linker script from the silicon vendor yourself. This is different than for example the avr-gcc toolchain that has pretty much everything built-in (headers and linker scripts) so you can start development with only a main.c (this can be due to AVR chips being less complex than Cortex-M chips and being only made by a single company).

Why choose a particular toolchain?

Familiarity

Familiarity is the single most powerful force when it comes to toolchain selection. It is much easier and faster to use the tools that you already know, right? The same applies to the team you are working with. If you have had ten successful projects with GCC you will highly likely stick with GCC. If you have had ten with IAR you will also stick with it.

This can be both good and bad. The good side is obvious. The bad one often is "we have always used XYZ 1.3 [that was released a decade ago] so we have to use it now", even when better alternatives are available.

Customer wants XYZ

This is an obvious requirement. If you are supposed to only provide a working device (+ manufacturing instructions) to a company that outsourced the development work to you then the toolchain selection is not that important (as long as you deliver a working product).

However, if you are a silicon vendor and you have to provide libraries for your customers to use then you have to target multiple toolchains. The same applies to firmware middleware and libraries. Of course there is no sense in trying to support all possible toolchains and their versions but at least in my silicon development career I had to support GCC and IAR simultaneously because the customers asked for it.

Maintaining an old project

This is one of the very few (in my opinion) reasons to stick with an old toolchain. Some electronic products have long lifetimes (for example in industrial automation) so it is not uncommon to find yourself working on a product that has been delivered a decade ago, qualified, certified, passed from one team to another across the years and has some small features added every second year. In that case you do not want to re-learn all the toolchain & application quirks.

It also makes sense to keep a minimal VM with the complete toolchain in case the OS becomes unavailable, unsupported or some license activation is not possible anymore. Naturally, an open-source toolchain like GCC or Clang is less at risk of disappearing in the future or deprecation and lack of support compared to a toolchain made only by a single company.

Certification / safety

Safety-critical product development comes with lots of special constraints. Depending on the overall safety architecture, requirements and relevant standards there may be a need for extra compiler "paperwork" (or qualification). Usually a commercial vendor like IAR can provide certification paperwork for its compiler. However, there are also companies like SolidSands that provide test suites for open-source compilers so it may still be possible to use GCC in a device needing functional safety.

Practical comparison of GCC, Clang and IAR

GCC and Clang from the user's point of view are nearly 100% identical. It is almost trivial to build an embedded C project that was started with GCC with Clang (unless you have used a horrible amount of GCC extensions). Object code can also be freely linked. For example GCC's libc can be linked with your Clang object files. You will immediately feel like at home as the vast majority of command line options are identical. There are some differences when it comes to warnings. I think that Clang supports more of them. IAR on the other hand comes with its own independent heritage so reading the documentation is necessary to find out the command line options. Of course GCC & Clang are open source so you can do almost anything you want while IAR comes in various editions, needs license servers, activation keys etc.

One GCC-specific extension that I really like are nested functions. They are basically functions within functions (and the inner function has access to variables of the outer function). I used them from time to time to limit the scope of small helper functions (like wrapping mathematical formulas or doing small computation). The problem is (apart from being only supported by GCC) that they are a surprisingly complex topic so I had to give them up for portability reasons.

Build systems

All three compilers work nicely with GNU Make. IAR also comes with its own iarbuild tool that uses an XML file to define the project files and options. This XML file is generated by the Embedded Workbench IDE (but can also be altered by hand).

IAR has slightly different command line options and arguments so the makefiles need tweaking. Overall, I would say that it is pretty easy to get IAR to work with plain makefiles.

In the past IAR was only available for Windows but as of today (from 2021?) IAR also comes as a command-line only package (bxarm) that runs on Linux. This means that an IAR project can be integrated into a CI pipeline with Jenkins or Buildbot just as easily as GCC or Clang.

Lifehack: use IAR on the build server to check that your code builds with IAR but avoid the hassle of setting up IAR & licenses on every developer workstation, keep developing primarily with GCC or Clang. If the IAR build fails the developer will get instant feedback from the build server.

Code portability

I can say from my experience that developing mostly low-level code using all three compilers is not that hard if you stick to standard C constructs, avoid inline assembly, don't use too many attributes and too many compiler extensions. All three compilers support at least C11. There are some differences when it comes to the standard library but I found them pretty harmless like missing itoa() (which is not part of the C standard but still handy).

Whenever some #ifdefs are necessary I wrap the toolchain-specifics into my own functions. For example myapp_itoa() that either calls the toolchain's itoa() or custom code. This allows me to keep the "messy" code in a single place, avoid duplication and make clean calls to myapp_something() everywhere else. You can also wrap on the linker level but that makes your build toolchain-specific.

Code size

I found out that IAR (configured for smallest code size optimization) consistently delivered binaries that were approximately 2-3% smaller compared to GCC with -Os. Clang on the other hand made binaries that were approximately 4% larger compared to GCC. I did not do any exact performance measurements or benchmarking. All binaries functioned identically, passed all my test cases, there were no timing-critical features only some communication code that was expecting responses within tens of milliseconds.

Is it significant? Maybe. All projects I have worked on (that could afford a Cortex-M) were not that code size sensitive. Price constrained products tend to gravitate towards cheap PIC16 and PIC12 anyway. My customers often asked for a particular memory headroom. For example "no more than 60% flash utilization for version 1.0" so the MCU was always over-specced. This is pretty reasonable if you plan to keep the same product alive for a number of years and extend its functionality along the way. However, if you are working on an extremely cost sensitive product, say, an ASIC motor controller with a mask ROM CPU then every single byte will count. For example many ICs sold as "motor controllers" actually have a built-in CPU that runs the control algorithm.

Small note: GCC & Clang binaries are always multiple of 4 bytes while IAR "stride" is 2 bytes. Poorly written code that CRCs the firmware blob word-wise will fail on IAR binaries.

Warnings

All three compilers give decent warnings. I have a feeling that Clang gives the best messages and output that points exactly to the issue, even within a single line (this is highly subjective). I found IAR to be picky about volatile accesses. Code like uint32_t x = REG_A + REG_B; would (correctly) give warnings that the access order is undefined. I have never seen this warning in GCC.

Sometimes some warnings have to be enabled and disabled within the code. For example I enable padding warnings (-Wpadded) is some structs that should have the exact layout preserved (eg. when transmitting them over a wire or saving to flash). The syntax and contents of #pragmas that change warnings are different between GCC/Clang and IAR so some #ifdefs are necessary.

Overall, it is great to build your code with multiple toolchains and as many warnings enabled as practical because some issues may be spotted only by the first tool, but not the second (and vice versa for other issues).

Attributes

In embedded projects sometimes you have to dive "below C". Attributes are a way of telling the compiler what to do with your code or data that can't be expressed in standard C. Some attributes that I tend to use:

  • alignment - Example: DMA needs word (4 bytes) alignment but you have an uint8_t, without the attribute the compiler can place the variable at any address and DMA may not work. Another example: the Cortex-M vector table needs to be aligned on 256 bytes (or more). The attribute is necessary when using it as an array in RAM.
  • section - Example: putting functions in RAM instead of flash (for speed or to let them run while flash is being modified).
  • forced inlining & flattening - Example: putting a function and all functions it calls in RAM. It is expensive but may improve performance (both speed & power consumption) or may be necessary (eg. during flash operations).
  • deprecation - When a variable or function is scheduled to be retired/removed it can show you where it is still used (or give a warning to a new developer that tries to use it).
  • unused (applied to function arguments) - Example: you have a family of functions and pass them as pointers. It is nice to typedef the function pointers but if the arguments of all of them are not matched you will get a warning about a bad cast. It may be easier to declare all functions with redundant arguments and mark the redundant ones as unused. Another example is a volatile variable that you plan to inspect with the debugger. It is "not used" as far as the compiler sees it but still serves a purpose.
  • format (checking that the arguments match the format string in printf and scanf-like functions) - Example: own, wrapped printf-like function used for logging.

The syntax differs across compilers. GCC and Clang are obviously 100% compatible, IAR is different but accepts some of GCC's attributes. My recommendation: wrap the attributes in macros and have a look at the ones already defined in CMSIS (see: cmsis_compiler.h).

Built-ins

Built-ins (or intrinsics) are the single biggest area that I find different across compilers. I would divide the intrinsics into two broad categories: the ones that emit a particular instruction that is not available in C (eg. WFE, WFI, SEI, SEV) and regular functions that may map efficiently to hardware instructions (eg. __builtin_bswap32 is a single REV16 in Cortex-M). They can either be "truly built-in" or they can be provided as assembly code wrapped in macros or functions.

Clang & GCC are pretty identical when it comes to built-ins but IAR is very different. Why do you need intrinsics in the first place and how often? I tend to use them maybe 2-3 times a year. I mostly use the ones dealing with operations on bits. Example: you read a status register from the hardware and want to find the first (or last) bit that is set to execute the right handler. You can loop and shift but in GCC you can also call __builtin_ffs (yes, it is a real name). It may (or may not) be more efficient than a regular loop but if the code is moved to a different CPU it may be built using more efficient instructions. For example __builtin_popcount maps to a single instruction in x86. A compiler may be smart enough to recognize what you are doing and still replace it with popcount. I prefer explicit (and well commented) code when it comes to handling seldomly used features so I would rather use an intrinsic clearly stating the intent rather than a loop.

Using intrinsics is easier than assembly because they still look and behave like regular C functions. It is still worth knowing which data manipulation operations map efficiently to CPU instructions. I worked once on a sensor ASIC that had a stream of data delivered to the CPU from an ADC. For some arcane reason the digital designers highly preferred to deliver the data from the ADC FIFO "slightly" out of order if I could reorder it cheaply on the CPU side. In that case I found the REV16 Cortex-M instruction (and the __REV16() intrinsic) that could do the reordering in a single cycle and some silicon area was saved.

Linker syntax & symbols

The three compilers obviously come with their own linkers. There is a lot of compatibility between GCC and Clang. Object code can be freely linked and linker script syntax is identical. The command line options for ld and lld are identical. IAR's xlink is a different beast.

I would say that the linker script syntax is where IAR tools really shine. The syntax actually makes sense and is human-readable. Example (I removed the comments):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
define symbol __ICFEDIT_intvec_start__ = 0x08000000;
define symbol __ICFEDIT_region_ROM_start__    = 0x08000000;
define symbol __ICFEDIT_region_ROM_end__      = 0x0801FFFF;
define symbol __ICFEDIT_region_RAM_start__    = 0x20000000;
define symbol __ICFEDIT_region_RAM_end__      = 0x20009FFF;

define symbol __ICFEDIT_size_cstack__ = 0x400;
define symbol __ICFEDIT_size_heap__   = 0x200;

define symbol __region_SRAM1_start__  = 0x20000000;
define symbol __region_SRAM1_end__    = 0x20007FFF;
define symbol __region_SRAM2_start__  = 0x20008000;
define symbol __region_SRAM2_end__    = 0x20009FFF;

define memory mem with size = 4G;
define region ROM_region      = mem:[from __ICFEDIT_region_ROM_start__   to __ICFEDIT_region_ROM_end__];
define region RAM_region      = mem:[from __ICFEDIT_region_RAM_start__   to __ICFEDIT_region_RAM_end__];
define region SRAM1_region    = mem:[from __region_SRAM1_start__   to __region_SRAM1_end__];
define region SRAM2_region    = mem:[from __region_SRAM2_start__   to __region_SRAM2_end__];

define block CSTACK    with alignment = 8, size = __ICFEDIT_size_cstack__   { };
define block HEAP      with alignment = 8, size = __ICFEDIT_size_heap__     { };

initialize by copy { readwrite };
do not initialize  { section .noinit };

place at address mem:__ICFEDIT_intvec_start__ { readonly section .intvec };

place in ROM_region   { readonly };
place in RAM_region   { readwrite,
                        block CSTACK, block HEAP };
place in SRAM1_region { };
place in SRAM2_region { };

What do we have here? Definition of addresses, then some sections (regions), magic names for stack & heap, what goes into RAM, what goes into ROM (flash) and... that is it. Even without reading any xlink documentation you can instantly understand 90% of this script. Now compare it with a basic GNU linker script that is almost 300 lines long.

Is it significant? Maybe (again). The projects I worked on did not require any elaborate linking schemes. I usually delivered a bootloader + bootloadable application combo so I did not have to hack linker scripts too much. There are some extra features provided by IAR. For example automatically adding checksums to the binary images. However, I prefer to have a Python script do the postprocessing explicitly (like adding checksum and magic values to the application so that it is recognized by the bootloader). To me the IAR linker syntax simply makes sense. The GNU syntax is too difficult to follow.

Startup code

There are surprisingly many things that have to happen before main() is reached. Startup code has many responsibilities. For example:

  • "Chip startup" eg. configuration of the clocks, busses and supplies
  • "CPU startup" eg. enabling the FPU
  • "Board startup" eg. enabling external RAM and bumping supply voltages to increase clock speeds
  • C runtime startup eg. initializing static variables and zeroing out memory, initializing the standard library

All these features are commonly called "the startup code" even though they deal with different things. Chip startup is provided by the chip vendor (eg. NXP). CPU startup comes from ARM (eg. CMSIS). Board startup depends on the exact PCB so most likely you have to write it yourself (or based on an existing demo). All these steps are mostly independent of the toolchain.

C runtime startup is the one that differs the most across toolchains. For example: in GCC the data initialization is done by simple loops (see __cmsis_start for an example), IAR has it own __iar_data_init3, newlib needs __libc_init_array and provides a _start() function, IAR has a __low_level_init etc. All these differences are mostly "cosmetic" unless you have a very complicated system with multiple memories, binary sections etc. The startup code is not portable (maybe only between GCC and Clang) but it is written only once, pretty small and usually included in the toolchain or demos from the chip supplier.

Analyzers & additional tools

What is great about GCC?

gcov

GCC toolchain provides an extremely useful coverage analyzer called gcov. Code coverage tools tell you which lines of code have been reached and executed. As we all know, code that was not tested is almost for sure not correct. And code that was never ran was for sure never tested. The output product of gcov is an HTML report that has all of your project's source code highlighted in green (executed) or red (not executed) plus the percentage of code executed per file.

Code coverage with GCC works basically like this:

  1. You compile your code with --coverage and GCC adds special instructions that track every branch (this of course affects runtime performance).
  2. The binary has to be linked with libgcov and implement a way to save data to a file (the easiest way is probably semihosting).
  3. The project now has to be ran on the actual target and exercised, preferably by a series of tests to have the code go into as many branches as possible.
  4. Once enough testing is done the code has to save coverage data to a file.
  5. The coverage file is postprocessed to an HTML report.

There are of course commercial coverage tools that may be easier to set up but gcov's main advantage is that it is simply part of GCC.

Static analyzer

GCC has had an integrated static code analyzer for a while now. It is enabled by adding -fanalyzer to compiler flags. You can think of this feature as simply building your code (but without producing any output files) and hoping to not get any build warnings.

GNU complexity

GNU complexity is a tool that analyzes code for cyclomatic complexity. In simple terms: the more branches (and control flow statements like break, continue, return) there are in the code the more difficult it is to understand (and to make sure that it does what is intended). A single if statement has two branches so you need at least two test cases to exercise them both. If a function has four return points you need at least four test cases. What if there are 50 or 100 or more possible code paths? You are in trouble. You are deeply in trouble. The code is untestable and unmaintainable. Having a code complexity analyzer allows you to quickly find code spots that need more care and refactoring. The sooner you spot them the easier the refactoring.

What is great about Clang?

Apart from extra warnings (and less language extensions than GCC) Clang has a very useful code analyzer called Clang-Tidy. The output is an HTML report. The analyzer covers different "spots" than the compiler. I did not find many issues in my code after first analysis but some of them were clearly valid (or the code was poorly written). There were some false-positives but they can easily be silenced using magic comments. Running Clang-Tidy is as easy as replacing clang with clang-tidy in the Makefiles (+some small flag adjustments).

What is great about IAR?

The tool I like the most from the IAR ecosystem is C-STAT. It is a static code analyzer (like Clang-Tidy) that includes MISRA checking. Some customers require you to provide code that is MISRA-clean (or document and justify all deviations). Due to various licensing & copyright shenanigans there are no open-source tools that can check your code for MISRA violations. I researched multiple MISRA tools and found IAR C-STAT to be the "least bad" (I could not get anything useful done with PC-Lint) because it is the only tool you can actually buy when you have a small team. Other vendors do not even give a price when they hear that your development team is less than 15 people.

C-STAT is very easy to add when you already have a build setup with IAR. Of course the whole codebase has to build cleanly with IAR first. The output is an HTML report that points to all issues in your codebase.

Conclusion (and why you should support multiple toolchains)

In my opinion the three Cortex-M compilers perform almost identically in 95% of portable C use cases. There are some small advantages and disadvantages that may be relevant depending on the exact project requirements. Binaries may have better or worse performance depending on your exact code so you should always profile your firmware if performance is important. To deliver really high quality firmware I simply recommend using a mix of tools:

  • GCC builds for "general use"
  • GCC builds with gcov
  • GCC -fanalyzer for static analysis
  • GNU complexity
  • Clang builds for testing portability and more warnings
  • clang-tidy for static analysis
  • IAR builds for testing portability and more warnings
  • IAR C-STAT for static analysis and MISRA checking

Does it look like overkill? With a proper CI setup the cost of supporting all the tools is next to nothing. Every developer can pick their favourite toolchain for daily work and CI takes care of the rest. Of course it is best to "test as you fly, fly as you test" so ideally you would have hardware-in-the-loop tests for all builds. Once you decide which toolchain is going to be used for the release builds it is best to concentrate most of development & testing effort on that particular build so you don't end up developing something for 2 months with GCC and hoping that everything will turn out perfect when built with IAR in the last week before delivery.

The biggest general "cost" I see is making a test suite for hardware-in-the-loop testing. This cost is totally independent of the selected toolchain. Adding one more binary from another toolchain to an existing test suite is practically free.

If your firmware is built with more than one toolchain you can also almost instantly rule out or confirm compiler bugs. I think everybody was at least once in a situation where a trivial code change in one place triggered some totally unexplainable behaviour in a different place. If it happens on two different binaries then you know that the issue is more likely in your code (or siliconšŸ˜) than in the compiler. Different compilers may also help uncover obscure implementation-defined behaviour that you were not aware of (say hello to missing volatile here and there). Undefined behaviour should of course be avoided at all cost by enabling compiler warnings.

Every time I get a failure notification from the CI server I feel a teeny-tiny spark of satisfaction that "the process is working" and that the safety net caught something automatically that could have surfaced much later and in worse circumstances.