Fast integer scaling on Cortex-M0

Integer scaling is very common in embedded systems. For example to make human-readable values of ADC readings (correcting for gain or voltage dividers). Two operations are needed to do scaling without using floating-point numbers – multiplication and division. Cortex-M0 is a nice little beast with a single cycle integer multiplier. ARM provides a smaller area 32-cycle multiplier but I have not yet come across a chip that would have it. All chip makers seem to choose the faster option. Cortex-M0 however totally lacks integer division. Division is handled by library functions. A simple code like y = x * 123 / 120; gets compiled to a function call. Software division of course takes time (and energy). There is a nice hack though 🙂

Division by arbitrary numbers is not supported, however the CPU can do single-cycle binary shifts using a circuit called a barrel shifter that has always been present in ARM CPUs. Binary shifting by 1 left is equivalent to multiplication by 2 and shifting by 1 right is equivalent to division by 2.

The hack

Instead of carrying out the scaling (eg. 1.23) by multiplying by an arbitrary integer (eg. 123) and dividing by an arbitrary integer (eg. 100) – pick any number that is a power of 2, use it as the denominator and find a numerator that gives the same scaling ratio.

Examples

Ratio Numerator Denominator
1.23 123 100
1.2266 157 128
1.2305 630 512
0.76 76 100
0.7656 49 64
0.7598 778 1024

Quirks

As you can see from the examples above – the total error is very small (should not matter for noisy ADC samples). The error becomes smaller with larger denominators (more resolution). You have to be aware of the limits of Cortex-M0 multiplier. It is a 32 x 32 -> 32 multiplier. This means that it can multiply any two 32-bit numbers but will return only lower 32 bits of the result. For example: you can freely multiply a uint8_t by an uint16_t, 65535 by 65536, but not 65536 by 65536 (the result would overflow).

Working example: let’s say that you want to scale a 10-bit sample from an ADC by 1.8. The maximum code from the ADC is 1023. If you pick a denominator of 1024 the numerator becomes 1843. 1023 multiplied by 1843 is 1885389 (top bit is 20). The result of the multiplication does not overflow, everything is fine.

Non-working example: scaling a 16-bit ADC sample by 15.4. Maximum ADC code is 65535. Denominator is 8192, numerator is 126157. Maximum ADC code multiplied by the numerator is 8267698995. The result is 33-bits wide which means that the operation overflowed.

Implementation

The code has volatiles to prevent it being optimized out. It is not exactly correct to do operations on an uninitialized variable 🙂

You can see that purely decimal scaling basically compiles to two __aeabi_uidiv function calls (and the function probably runs some kind of an iterative algorithm to do the division). MULS is a single-cycle multiplication instruction. Let’s skip the other instructions for now.

Division code with a divisor that is a power of two omits the library calls and simply compiles to single-cycle LSRS (shift) instructions. The real-world code can be much shorter without volatiles. I used volatile variable just as an example.

The divisor must be a compile-time constant! Otherwise the compiler will not know that it is a power of division that can be executed as binary shifts.

Screenshots taken from Ozone. Literally THE BEST Cortex-M debugger.