Integer scaling is very common in embedded systems. For example to make human-readable values of ADC readings (correcting for gain or voltage dividers). Two operations are needed to do scaling without using floating-point numbers – multiplication and division. Cortex-M0 is a nice little beast with a single cycle integer multiplier. ARM provides a smaller area 32-cycle multiplier but I have not yet come across a chip that would have it. All chip makers seem to choose the faster option. Cortex-M0 however totally lacks integer division. Division is handled by library functions. A simple code like `y = x * 123 / 120;` gets compiled to a function call. Software division of course takes time (and energy). There is a nice hack though ðŸ™‚

Division by arbitrary numbers is not supported, however the CPU can do single-cycle binary shifts using a circuit called a barrel shifter that has always been present in ARM CPUs. Binary shifting by 1 left is equivalent to multiplication by 2 and **shifting by 1 right is equivalent to division by 2**.

# The hack

Instead of carrying out the scaling (eg. 1.23) by multiplying by an arbitrary integer (eg. 123) and dividing by an arbitrary integer (eg. 100) – **pick any number that is a power of 2, use it as the denominator and find a numerator that gives the same scaling ratio**.

## Examples

Ratio | Numerator | Denominator |

1.23 | 123 | 100 |

1.2266 |
157 | 128 |

1.2305 |
630 | 512 |

0.76 | 76 | 100 |

0.7656 |
49 | 64 |

0.7598 |
778 | 1024 |

## Quirks

As you can see from the examples above – the total error is very small (should not matter for noisy ADC samples). **The error becomes smaller with larger denominators** (more resolution). You have to be aware of the limits of Cortex-M0 multiplier. It is a 32 x 32 -> 32 multiplier. This means that it can multiply any two 32-bit numbers but will return only lower 32 bits of the result. For example: you can freely multiply a `uint8_t` by an `uint16_t`, 65535 by 65536, but not 65536 by 65536 (the result would overflow).

Working example: let’s say that you want to scale a 10-bit sample from an ADC by 1.8. The maximum code from the ADC is 1023. If you pick a denominator of 1024 the numerator becomes 1843. 1023 multiplied by 1843 is 1885389 (top bit is 20). The result of the multiplication does not overflow, everything is fine.

Non-working example: scaling a 16-bit ADC sample by 15.4. Maximum ADC code is 65535. Denominator is 8192, numerator is 126157. Maximum ADC code multiplied by the numerator is 8267698995. The result is 33-bits wide which means that the operation overflowed.

# Implementation

The code has volatiles to prevent it being optimized out. It is not exactly correct to do operations on an uninitialized variable ðŸ™‚

You can see that purely decimal scaling basically compiles to two `__aeabi_uidiv` function calls (and the function probably runs some kind of an iterative algorithm to do the division). `MULS` is a single-cycle multiplication instruction. Let’s skip the other instructions for now.

Division code with a divisor that is a power of two omits the library calls and simply compiles to single-cycle LSRS (shift) instructions. The real-world code can be much shorter without volatiles. I used `volatile` variable just as an example.

**The divisor must be a compile-time constant!** Otherwise the compiler will not know that it is a power of division that can be executed as binary shifts.

Screenshots taken from Ozone. Literally THE BEST Cortex-M debugger.