Multicore MCUs Offer New Embedded Options

By Kristin Lewotsky

Contributed By Electronic Products

2011-10-26

Embedded systems designers face constantly increasing demands for more performance and faster time-to-market. The embedded processors need to perform an expanding repertoire of tasks, often in real time. Meanwhile, applications demand high throughput and energy efficiency coupled with small form factors and low cost. Multicore microcontroller units (MCUs) provide a viable new solution, leveraging modular design to deliver multifold performance increases at an economical price.

For decades, as the number of transistors on an IC rose, chip performance kept pace. Ever more sophisticated architectures featuring techniques such as caching and pipelining allowed chip designers to use the increasing density of the silicon to continually boost processing speed. That is no longer the case. Chip designers have exhausted the possibilities of alternative architectures. Productivity jumps have dropped from keeping pace with Moore's Law to less than half that. The only way to increase productivity today is to leverage modularity by using multiple CPUs. That has led to the development of multicore MCUs.

Hardware: homogeneous versus heterogeneous

We define a multicore MCU as a microprocessor featuring two or more CPUs that coherently share a common memory. In the multicore architecture, each processor features its own instruction stream acting on its own data stream (MIMD).

Multicore MCUs can be classed as homogeneous or heterogeneous. As the name suggests, homogeneous modules feature two identical CPUs that can run operations in parallel or redundantly. Designed for safety applications, the Hercules line from Texas Instruments boasts dual ARM Cortex-R4F CPUs running in lockstep. They perform the same operation, then compare the results each clock cycle, effectively establishing a "safe island" that provides designers with a reliable foundation to implement more sophisticated operations in medical, industrial, and automotive applications. To remove potential common failure modes, the design team oriented the chipset 90° to one another and introduced a delay in the timing of the processors. The chip can operate at up to 200 MHz and sports 32 Mbytes of flash memory.

For applications with specialized requirements—for example, computationally-intensive operations or large amounts of signal processing — a heterogeneous MCU may provide a better solution. A heterogeneous MCU incorporates different types of processors. It might feature a general-purpose CPU along with a digital signal processor (DSP) and/or a processor dedicated to floating-point arithmetic.

TI’s Concerto heterogeneous multicore MCUs, for example, combine a C28x 32-bit CPU and an ARM Cortex M3 32-bit CPU to optimize the subsystems (Figure 1). The C28 manages the control subsystems, providing floating-point operation at up to 150 MHz. Meanwhile, the ARM Cortex handles communications, logic, and sequencing/monitoring, with speeds of as much as 100 MHz. The MCU incorporates error detection on both flash memory and RAM, as well as built-in clock monitoring with multiple system watchdogs.

Image of Heterogeneous multicore MCUs

Figure 1: Heterogeneous multicore MCUs such as the Concerto incorporate different cores that provide optimal solution for each task.

A true multicore MCU requires more than just multiple cores with shared memory. To permit effective parallel processing, the architecture has to ensure that each CPU is operating on the most up-to-date possible data. In a dual core MCU, each CPU has a dedicated level I (L1) cache, but all CPUs share a level 2 (L2) cache (Figure 2). The challenge is ensuring that if CPU1 updates the variable in its L1 cache, CPU2 winds up with the correct information, and not the old data that its L1 cache previously acquired from the L2 cache. Designs typically accomplish this using a hardware monitor known by various names, including coherency module or, simply, snooper. When CPU 1 saves a variable to its L1 cache, for example, the coherency module registers the change and invalidates the data in CPU2’s L1 cache. When CPU2 tries to access that location, it is not able to and must go to the L2 cache for the new data.

Image of The QorIQ family from Freescale
Figure 2: The QorIQ family from Freescale features a coherency module to monitor changes to the level I caches of each CPU to ensure that each core operates on the most up-to-date data.

The QorIQ P2 platform series communications processors from Freescale Semiconductor are based on dual e500 Power Architecture cores and feature dual 32-Kbyte L1 caches for each CPU, plus a 256-Kbyte L2 cache. Users have the option of partitioning the L2 cache between the two cores or configuring it as stashing memory or SRAM. A P2020 evaluation board, designated P2020COME-DS-PB-ND, allows engineers to become familiar with the intricacies of Freescale’s dual-core MCUs.

Functional parallelism versus data parallelism

Hardware is just the start. The big benefit to multicore MCUs is the way programmers can leverage software to maximize productivity. There are multiple ways of programming multicore MCUs. In symmetric multiprocessing (SMP), perhaps most common approach, all CPUs have access to the common memory space and are run by a single operating system. CPUs communicate via variables in the shared memory. Any of the CPUs can run any process, although typically processes are not shared among CPUs at any given time.

Asymmetric multiprocessing (AMP) provides more degrees of freedom. In AMP, specific CPUs can be assigned to certain processes to achieve optimal performance. Asymmetric architectures can even run different operating systems on different processors, running a real-time operating system (RTOS) on core handling, time-sensitive operations while the general-purpose core operates on Linux.

One of the primary benefits of multicore MCU this is the ability to parallel process. Parallel processing can be divided into functional parallel processing and data parallel processing. Functional parallelism involves breaking up the task into individual operators. Different processors perform different functions. It is a powerful technique, but is not where the real muscle lies in the multicore approach.

Data parallelism provides the biggest performance boost. It involves dividing the data into individual pieces that are processed by the different cores. It is a powerful technique but because the CPUs communicate through the shared memory, synchronization is essential to ensure that CPUs are conducting their operations in the proper order and with the correct data.

Multithreading, or fork/join parallelism, provides one method for ensuring synchronization. The system divides the processing into threads, splitting up the data among the CPUs, each of which running the same code on its piece of data. When the threads complete their operation, they recombine to produce a result. Until all the threads are completed, the operation is not concluded, so this approach ensures synchronization.

Using a first in/first out (FIFO) buffer provides another method for synchronization. When the CPUs communicate through the FIFO, they can only write to it if it is empty. The data enforces priority — if the buffer is full, the CPU coming later in the process cannot write to it. It has to wait its turn.

Mutual exclusion (mutex) locks offer a more sophisticated approach to synchronization. Implemented in hardware, the mutex lock ensures that only one CPU has ownership of a shared variable at any given time. When a given thread begins its operation, it reads the variable and sets the lock, which blocks other threads from accessing the information. When the operation concludes, the thread releases the lock so that it can be accessed by another.

Particularly with parallel processing, multicore MCU's provide powerful solutions for embedded design. They must be designed and programmed with care, however. The more threads, the more challenging the process becomes. Bugs can cause systems to deadlock, wind up in loops, or even produce variable results that accidentally depend on which thread finishes first.

Hardware design has its own challenges. Although the solutions are powerful, it is important to be aware that they are not necessarily by definition the best solution for a given application. Users need to consider processor capabilities and bandwidth limitations—as fast as the cores might be, they will all be working with the same communications bus, ADCs, and other resources.

Overall, multicore MCUs provide useful options for a wide range of embedded design challenges. Tools exist to simplify embedded design and programming. By accessing them and paying careful attention to test and validation, design teams can get high performance at and economical price point and speed time to market for their products.

Disclaimer: The opinions, beliefs, and viewpoints expressed by the various authors and/or forum participants on this website do not necessarily reflect the opinions, beliefs, and viewpoints of DigiKey or official policies of DigiKey.