Herman Roebbers, Trainer, 19 October 2018
Herman Roebbers is an advanced expert at Altran and has been working on embedded systems and parallel arithmetic since the mid-1980s. He is also an external advisor to the EEMBC working groups Ulpbench, Iotconnect and Securemark, and gives the workshop “Ultra-low power for the Internet of Things.“
In the pursuit of battery-less IoT devices, it is important to use energy as efficiently as possible. By utilizing an encryption library, Herman Roebbers shows how small tweaks to the tooling and chip settings alone can have a huge impact on consumption.
How can I reduce the energy consumption of my IoT system? This question is becoming more and more relevant as we continue to increase our expectations of IoT devices. Ultimately, the goal is that systems require so little energy, they can harvest it from their environment and no longer need batteries.
To achieve this, we need to work in two directions, increasing harvest yields and reducing consumption. The first is being addressed: new materials and methods to make and remanufacture solar cells produce ever-higher yields. Success is also being made in the field of RF energy harvesting. The Delft startup Nowi, for example, has made special chips that are very good at this. Furthermore, a lot of research is being done on new materials to convert temperature differences into energy more efficiently. We are also working hard on increasingly efficient converters that convert harvested energy into required voltage(s) and ensure efficient energy storage, for example, in rechargeable batteries or supercapacitors.
A case study
An earlier Bits&Chips article gave an overview of all aspects that are important to save energy: from chip substrate, transistor selection, processor architecture and the circuit board to driver, ox, encoding tools and ways up to the application. In the meantime, the table has expanded.
A recent case illustrates the effect of different mechanisms on energy. EEMBC just released a benchmark to determine the energy needed for several typical tls (transport layer security) operations. Tls is part of an https implementation and, as such, is essential for setting up a secure connection. The benchmark has been ported to an evaluation board that supports cryptography through the Arm Mbedtls library.
We can use that process to show what each optimization step delivers. For this purpose, we first perform a baseline measurement each time. The benchmarking framework uses an energy monitor from Stmicroelectronics and an Arduino Uno. The Arduino is used as a uart interface towards the device under test (dut, Figure 1).
Figure 1: The setup for measuring power consumption
We also use the development environment Atollic Truestudio 9.0.1 for STM32, which uses a proprietary version of the GCC compiler, as well as the Stm32cubemx software, which can generate (initialization) code for peripherals and thus considerably simplifies configuration.
Step 1: Look at the compiler settings
If we do a baseline measurement with a non-optimized version (setting -O0) at 80 MHz (highest speed) and 3.0 volts, this results in a Securemark of 505. If we change the optimizer setting to -O1, this makes a huge difference: we’re going to 1336! The optimizer settings for -Og and -O2 don’t make much difference, but if we go to -O3 or -Ofast, things will go even better: 1490.
This demonstrates what you can achieve with the compiler settings alone. The ideal settings, however, can differ per function. In our case, for example, there is no difference between -O3 and -Ofast, but this is not always the case. So, it may pay off to choose the settings per function or per file separately.
With the compiler settings -O2, -O3 and -Ofast, programming errors may appear that do not occur with other settings. Timing can change, and it is necessary to qualify variables that are used in multiple contexts (e.g. normal and interrupt context) as volatile to avoid problems.
Step 2: Look at the pll
Microcontrollers nowadays have very extensive settings for all kinds of clock signals on the chip. One of those settings concerns the frequency multiplier (pll). This can be used to multiply and divide a low frequency to create all kinds of other clock speeds.
In our case, the frequency of the internal oscillator is 16 MHz. To make it 80 MHz, we cannot simply multiply it by five, unfortunately. We have a choice of two settings: the first is to divide by 1, multiply by 10 and then divide by 2. The second option is to divide by 2, multiply by 20 and then divide by 2 again.
That gives different scores: 1462 against 1490. The result in both cases is 80 MHz, but the second method is two percent more economical. The lower the clock frequencies, the less energy you lose, and the sooner you divide the clock frequency, the better.
If you have enough time at your disposal, you can also use the processor without pll, because that’s actually quite an energy guzzler. With the built-in oscillator, we can generate a maximum frequency of 48 MHz, which results in a 4 percent higher score. The disadvantage is that it takes a bit longer: 80/48 = 1.66 times longer to be precise.
The Nucleo L4A6ZG development board from STMicroelectronics offers a lot of tools to optimize energy use.
Step 3: Turn off unnecessary clocks
Now that we have explored a few things, we can choose a setting and further optimize from there. We start quite conservatively: -O1 and a frequency of 80 MHz via our second pll setting. This brings our Securemark score to 1336.
The next step is to turn off all superfluous clocks. In our case, the clock to the uart and all i/o ports can be turned off. This saves between 2.5 and 2.9 mW and gives a score of 1448. This costs 1448/1336 = 1.08 times less energy (8 percent gain).
Step 4: Optimize the memcpy function
During the execution of cryptographic functions, the memcpy function is regularly used. Opting for an optimized version yields a five percent profit in the case of GCC. The IAR compiler already provides an optimized version. This allows us to increase our score to 1524. Profit: 5 percent.
Step 5: Tighten the thumbscrews
Now we can see if we can also do it with a low clock frequency. With that, we could lower the core voltage. For our mcu, this core voltage should be 1.2 V for frequencies above 26 MHz. For simplicity, we take 24 MHz, a standard frequency in the menu of the msi oscillator, where the pll can remain off – another 4 percent gain: 1588.
We can also test whether we can safely set the compiler optimizations a bit more aggressively. If we go to setting -O2, we arrive at a score of 1691 – another 6.4 percent gain.
Step 6: Reduce tensions
We have already prepared the clock frequency for a lower core voltage. Now we are actually going to set it. The result is beautiful: 2021, almost 20 percent profit!
The power supply voltage can also be a bit lower. We started at 3.0 V, but if we go to 2.4 V, that again gives an improvement of 26 percent. We can go even further to 1.8 V if necessary. We haven’t done that here, but if we extrapolate, we can expect a further saving of a third.
With a few simple measures, energy consumption can already be drastically reduced. In our case study, a factor of five to seven is easily achievable compared to a non-optimized version.
'An embedded system without batteries is within reach.'
However, I have limited myself here to tooling and chip settings. With additional measures in other areas, tens of percent extra improvement can be achieved. Therefore, an embedded system without batteries is within reach.