The MAX32655: Why Two Cores Are Better Than One
Two cores are better than one! Sometimes, two different cores are better than two identical cores! This application note explains why.
When most people think of multicore processors, their minds run immediately to personal computers, smartphones, or other high-level computing devices. These devices uniformly have a sophisticated user interface and a preemptive mode operating system.
When we read the reviews for a new PC or a new smartphone, one of the things usually mentioned is the number of cores: 'The CPU in this device has four cores that support up to eight threads.' Armed with the information, we know that the operating system can dispatch tasks to any of the cores to try to balance the overall workload.
All the computing cores are as nearly identical as possible in this sort of environment. After all, it would not make sense to have to compile every task separately for different CPU architectures.
But for deeply embedded processors, having lots of cores of the same type may not be as desirable as having multiple, specialized cores.
The case for differentiated cores
First, in many deeply embedded applications, there may not even be an operating system. It is up to the designer to decide what tasks run on which cores, when, and for how long. Second, it is very likely that the main task will run in a single thread and any additional cores will be there to run particular, specialized tasks.
One does not really care whether the cores are all the same in these cases, with one CPU handling all the supervisory tasks and other, subordinate computing cores handling specialized tasks. It makes more sense to have specialized cores that have more capability in a particular area, or use less power, or consume less silicon area on the die.
Consider, for example, the MAX32655. It is a microcontroller based on the Arm Cortex-M4–a powerful, well-regarded, single-threaded CPU core. The MAX32655 comes with a half-megabyte of flash,128K of RAM, and lots of peripherals, including a Bluetooth® Low Energy radio.
Now, stop right there. Bluetooth Low Energy may sound like a hardware feature, but it has as much to do with software as hardware. You see, the hardware is actually fairly simple: the transmitter and receiver have to be somewhat frequency agile in the sense that they must tune to any one of forty possible channels, and they have to do it rapidly. Both the transmitter and receiver switch on and off quickly to conserve power. And they have to handle Gaussian frequency shift keying modulation and demodulation. Not trivial, but not particularly challenging either.
But the software–that can be the real challenge! The host has to manage a whole set of protocols – the Attribute Protocol, the Link Layer Control and Adaptation Protocol, the Security Manager Protocol, and others. And there is an interface that connects the host to the controller layer. It is this layer that manages the radio hardware, and it is not trivial either. There are a device manager, link manager, and link controller—functional blocks that abstract the particulars of the radio hardware up to the host protocols.
The good news about the BLE software stack is you do not have to write it. In most environments, the software stack comes along when you specify a particular BLE device. The bad news is that the BLE software stack is demanding. It is not that it takes a particularly large swath of processor bandwidth or storage, but timing in the BLE stack is critical. That means when BLE needs something done, it needs to be done now! And if the microcontroller core is already performing some time-critical function, something has to give!
This is where a second core comes in. The primary core is freed to devote its resources to the user program because it can hand off the BLE tasks to a second core. There are no latency concerns, no worries that the BLE stack will interrupt the core at just the wrong moment. That second, independent core solves a lot of problems!
Choosing the right core
A second core makes sense for running the BLE stack. But which core to choose?
One option is to go with a second Arm Cortex-M4. The M4 brings a lot of resources to the table. It is native to the AMBA bus and the Bluetooth libraries all compile very nicely for the Cortex-M series of cores. So, why not a second M4?
The truth is, as efficient as the Arm Cortex-M4 is, it is more CPU than required for the specialized task of running the BLE stack. A smaller, slower, more modest CPU will do nicely. But which CPU?
Say hello to the RISC-V (pronounced risk-five) core! Like the Arm Cortex-M4, it is a 32-bit CPU based on a RISC architecture. RISC, or Reduced Instruction Set Computing, is a philosophy of computer architecture that asserts that a small set of simple instructions results in more efficient execution than a larger set of more complex instructions. For the most part, RISC has largely won the argument over CISC (Complex Instruction Set Computing). But there is still one huge redoubt of CISC computing: the x86 architecture. Most other computing environments—and just about all embedded processors—have embraced RISC.
The first three things that jump out when evaluating a new CPU core are the instruction set architecture, register complement, and programmer's model. The next section explores how those factors differ between the RISC-V and the Arm Cortex-M4.
Comparing the instruction sets
Before looking at the instruction sets, remember: high-level languages like C are not the computer's native tongue. The C compiler converts the program to a set of machine instructions to be executed by the core. A single C statement may translate to dozens of machine instructions. C compilers have become so efficient that assembly language programming, where one statement translates to exactly one machine instruction, has become something of a lost art.
The Arm Cortex-M4 is based on the ARMv7E-M instruction set architecture. And while the Arm Cortex-M4 is a 32-bit CPU, the ARMv7E-M instruction set uses Thumb-2 instruction encoding, and mostly, these are 16-bit instructions. There is a story here.
Originally, the Arm instruction set was 32-bit. Even though the underlying architecture is RISC, the instruction set was rich, but not particularly efficient in terms of code size. Arm came up with Thumb encoding in the mid-90s to address these concerns about instruction set efficiency. Thumb instructions were 16-bits in length and corresponded pretty closely with a subset of the previously used 32-bit encoding. The instruction set was more efficient, but left many features behind.
Thumb-2, introduced in 2003, uses a mix of 16- and 32-bit encodings, and keeps most of the code density advantages while restoring some 32-bit encoding features. Of course, there is one disadvantage to this scheme: it results in a variable-length instruction set. But today, the Thumb-2 encoding is well-established and well-understood, and in fact is the only instruction encoding supported by the Arm Cortex-M4 core.
Most RISC-V implementations, in contrast, including the one used in Maxim's MAX32655, use pure 32-bit encoding. RISC-V supports optional compressed encoding, their term for 16-bit instruction support. But unlike the Thumb instruction encoding in the Arm Cortex-M4, it's truly optional.
But what instructions are there?
Both the ARMv7E-M and RISC-V instruction sets support load-store architecture. That means, operands for arithmetic and logical instructions must be explicitly loaded from memory into registers, and once the operation is performed, the result must be explicitly stored back to memory.
Load-store architecture is something of a hallmark of RISC cores, but beyond that, there is not much 'reduced' about the ARMv7E-M instruction set. The ARM instruction set contains dozens of instructions and instruction variants, with conditional execution, shifting as an instruction operand, lots of bit and byte manipulation sub-instructions, and several memory addressing modes. Add it all up, and it makes an ARM core a pretty complex piece of logic.
Not so for the RISC-V. There are just forty base instructions in the RV32I instruction set. And as many of these can be grouped into instruction families, the instruction set becomes even more comprehensible: there are six branch instructions, five load instructions, three store instructions, and two 'jump and link' instructions used for function calls. Treat those instruction families together, and there are only 28 instructions to think about.
To make things even simpler, there are only six encoding formats: register to register, immediate, upper immediate, store, branch, and jump. And that is it! The code generation piece of a RISC-V compiler has an easy time of it.
Look at all those registers!
Now, register complement: the ARMv7E-M architecture specifies 16 registers, of which 13 are general-purpose. The RISC-V architecture specifies 32 registers, of which all but one is general purpose. Here is how they work.
Arm registers are named R0 to R15. But the top three registers have special purposes: R15 is the program counter, and any write to R15 is effectively a jump to that location. R14 is the link register. When a branch-and-link (a function call) is executed, the return address is stored in R14. R13 is the stack pointer. PUSH instructions decrement R13 and store the values in the specified registers to descending memory locations; POP instructions load values from ascending memory locations and increment R13.
RISC-V registers are named x0 to x31 and there is only one user-visible special purpose register: x0 always returns a zero value when read and anything written to x0 is discarded. This is a real boon to simplifying instruction set encoding.
All the other registers are truly general-purpose. By convention, x1 is used as the return address (what Arm would call the link register) and x2 is used as the stack pointer, but there's nothing in the hardware to enforce that (unless compressed extensions have been implemented in the core design, but that is another story). While the ARM BL (Branch and Link) instruction always stores the return address in R14, the RISC-V JAL (Jump and Link) instruction lets you specify the register that gets the return address.
Some instructions seem to be missing from the RISC-V instruction set such as a register-to-register move operation! Instead, the contents of one register are moved to another using the ADDI instruction: ADDI rd, rs, 0 adds the contents of rs to the value zero and stores the result in rd, effectively moving whatever was in rs to rd. Neat!
You may think that ADDI is a poor substitute for MOVE because arithmetic instructions modify the flags register, and MOVE does not. In the RISC-V core, that's not a concern because there is no flags register! Instead, the branch instructions—BEQ (Branch if Equal), BNE (Branch if Not Equal), BLT (Branch if Less Than), BGE (Branch if Greater than or Equal), BLTU (Branch if Less Than, Unsigned) and BGEU (Branch if Greater than or Equal, Unsigned)—all accept two register specifiers, perform the comparison, and then conditionally take the branch.
But how to branch if the result is zero? Just write BEQ rs x0
So, while the instruction set and programmer's model for RISC-V may look a little unfamiliar to those used to the Arm instruction set, it really is powerful and complete.
How do these cores stack up?
So, now we come to the big question: why use a RISC-V core to run the BLE stack?
The answer lies in the customizability. All commercial CPU cores allow some degree of customization, and that goes for the Arm Cortex-M series as well. Generally, though, the customization options afforded in the Arm cores are large-grained. Does it have floating-point arithmetic, for example, or embedded trace?
The RISC-V core, in contrast, has a whole set of extensions that the chip designer can enable or disable: three kinds of floating point units (single, double, and quad precision), atomic instructions, integer multiply-divide, compressed instructions to save code space, and others, specifications for which are not yet frozen.
When designing a core to run the BLE controller stack, the RISC-V core can be built with just integer multiply and divide extensions. There is no need to include the other features, because they add nothing to running the BLE stack. So, there is no reason to occupy the silicon space and burn power supporting these features. We can clock the core more slowly than we clock the main Arm Cortex-M4. Clocking the core more slowly can reduce the core's power consumption.
That is why knowing the purpose of a second core in advance helps the silicon designer and helps to create the best microcontroller experience!