Add High-Performance Speech Keyword Spotting to IoT Designs: Part 2 – Using MCUs

By Stephen Evanczuk

Contributed By Digi-Key's North American Editors

Editor’s Note: Using an emerging class of efficient algorithms, any developer can now deploy sophisticated keyword spotting features on low-power, resource-constrained systems. Part one of this two-part series showed how to do it with FPGAs. Here in Part two we will show how to do it with MCUs.

Keyword spotting (KWS) technology has emerged as an increasingly important feature for wearables, IoT devices, and other smart products. Machine learning methods can provide exceptional KWS accuracy, but the power and performance limitations of these products have until recently limited use of machine learning KWS solutions to the largest enterprises or highly experienced machine learning experts.

However, developers increasingly need to implement voice activated KWS features in wearables and other IoT devices on more efficient KWS engines that are able to operate within the resource constraints of these devices. Depthwise separable convolutional neural network (DS-CNN) architectures modify conventional CNNs to provide the needed efficiency.

Using hardware-optimized neural network library, developers can implement MCU-based DS-CNN inference engines that require minimal resources. This article describes the DS-CNN architecture and shows how developers can implement DS-CNN KWS on MCUs.

Why KWS?

Consumer acceptance of voice activated features on smartphones and home appliances using phrases such as "Alexa," "Hey Siri," or "Ok Google" has rapidly evolved to broad demand for voice services on nearly any product designed for user interaction. Underlying these services, accurate speech recognition relies upon a variety of artificial intelligence methods to identify spoken words, and interpret words and word phrases as commands appropriate to the application.

However, the resources required to quickly and accurately complete this entire voice command sequence starts to exceed the capabilities of low-cost line-powered consumer hubs, much less battery-operated personal electronics.

The voice command pipeline requirements

To deliver voice activation on these products, developers split the voice command pipeline or limit voice commands to a few very simple words such as "on" and "off." On a resource-limited consumer product, developers implement KWS capabilities using neural network inference engines able to deliver the required high accuracy, low latency response to simple commands or command activation phrases for Alexa, Siri, or Google (Figure 1).

Here, the design digitizes the input audio stream, extracts speech features, and passes those features along to a neural network for identification of the keyword.

Image of voice activation with KWS processing pipeline

Figure 1: Voice activation with KWS uses a processing pipeline that extracts frequency domain features from an audio input signal and classifies the extracted features to predict the probability that the input signal corresponds to one of the labels used to train the neural network. (Image source: Arm®)

By converting the amplitude modulated audio input stream to features in a frequency spectrogram, developers can take advantage of the proven ability of convolutional neural network (CNN) models to accurately classify the spoken word according to one of the labels used during neural network training.2 For more complex voice interfaces, the command processing pipeline extends beyond the device itself. After the KWS inference engine detects the activation keyword or phrase, the product passes the digitized audio stream to cloud-based services able to more effectively handle complex speech processing and command recognition operations.

Still, the conflict between device resource availability and inference engine resource requirements has confounded developers' attempts to apply these methods to even smaller designs for wearables and IoT devices. Although the classic CNN is relatively well understood and straightforward to train, these models can still be resource intensive. As the accuracy of CNN models in recognizing images has increased dramatically, CNN size and complexity have also increased significantly.

The result is very accurate CNN models that require billions of compute-intensive general matrix multiply (GEMM) operations for training. Once trained, the corresponding inference models can occupy hundreds of megabytes of memory and require a very large number of GEMM operations for a single inference.

For battery-operated wearables and IoT devices, an effective KWS inference model must be able to run in limited memory with low processing requirements. In addition, because a KWS inference engine must operate in "always on" mode to perform its function, it must be able to operate with minimal power consumption.

This dichotomy between the potential of neural networks and the limited resources in the increasingly attractive arena of wearables and IoT devices has attracted significant attention from machine learning experts. The result has been development of techniques for optimizing the basic CNN model and the appearance of alternative neural network architectures able to bridge the gap between performance requirements and resource capabilities of small resource-constrained devices.

Small footprint models

Among the techniques for creating small footprint models, machine learning experts have applied optimization methods such as network pruning and parameter quantization to produce CNN variants able to deliver results nearly as accurate as full CNNs, but using a fraction of the resources. The success of these reduced precision neural networks paved the way for binarized neural network (BNN) architectures that reduce model parameters from the 32-bit floating-point, or even 16- and 8-bit found in earlier CNNs, down to 1-bit values. As described in Part 1, the Lattice Semiconductor machine learning SensAI™ platform uses this highly efficient BNN architecture as the basis for a 1 milliwatt (mW) KWS solution running on its iCE40 UltraPlus FPGA-based mobile development platform, or MDP.

Along with reductive techniques such as network pruning and parameter quantization, there are other approaches to lowering resource requirements that modify the topology of the CNN architecture itself. Among these alternative architectures, the depthwise separable convolutional neural network offers a particularly effective approach for creating small, resource efficient models able to run on general purpose MCUs.

Building on earlier work, Google machine learning experts found a way to increase the efficiency of CNNs by focusing on the convolution layer itself. In a conventional CNN, each convolution layer filters input features and combines them into a new set of features in a single step (Figure 2, top).

Diagram of convolution layer filters input features

Figure 2: Unlike a full convolution (top), depthwise separable convolution first uses a DKxDK filter (middle) to separately filter each of the M input channels and uses a pointwise 1 x 1 convolution to create N new features. (Image source: Google)

The new approach breaks filtering and feature generation into two separate stages, together called a depthwise separable convolution. The first stage performs a depthwise convolution which acts as a spatial filter on each channel of an input (Figure 2, middle). Because this first stage does not create new features (the core objective of a deep neural network architecture), the second stage performs a pointwise 1 x 1 convolution (Figure 2, bottom) that combines the outputs of the first stage to generate new features.

Used in Google's MobileNet models for mobile and embedded vision applications, this DS-CNN architecture reduces the number of parameters and associated operations, resulting in smaller models that require significantly fewer computations to achieve accurate results.3

Compared to full convolutions, the use of depthwise separable convolutions in MobileNet models reduce accuracy only by 1% on the industry standard ImageNet data set, but use less than 12% of the multiply-add operations and 14% of the number of model parameters required for full convolutions in conventional ImageNet CNN models.

Although DS-CNNs were originally developed for image recognition, these same models can serve audio recognition simply by transforming an audio input stream into a frequency spectrogram to provide a set of usable features. In effect, an audio front end converts the audio stream to a set of features that the DS-CNN can classify. For speech processing, the features produced by the front end typically take the form of Mel-frequency cepstral coefficients (MFCC), which more closely match human auditory characteristics while significantly reducing the dimensionality of the feature set passed to the DS-CNN classifier. This is precisely the approach used in the ARM ML-KWS-for-MCU open-source software repository.

DS-CNN implementation

Designed to demonstrate KWS implementation on Arm Cortex®-M-series MCUs, the ARM KWS repository provides an extensive set of pre-trained TensorFlow models in multiple architectures including conventional CNNs, DS-CNNs, and others. Trained with the Google speed command dataset4, the models classify audio input as one of 12 possible classes: "Yes", "No", "Up", "Down", "Left", "Right", "On", "Off", "Stop", "Go", "silence" (no word spoken), and "unknown" (representing the other words contained in the Google dataset).

Developers can immediately use these pre-trained models to compare inference performance of these alternative neural network architectures and examine their internal structure. For example, after running TensorFlow's import_pb_to_tensorboard Python utility on ARM's pre-trained DS-CNN model, the developer can use TensorBoard to visualize the model's MobileNet-based architecture.

Diagram of Arm pre-trained KWS model displayed in TensorBoard

Figure 3: Displayed in TensorBoard, the Arm pre-trained KWS model combines a familiar MobileNet DS-CNN model (red outline, left) with a frequency domain feature extraction stage (expanded, on right) using Mel-frequency cepstral coefficients (MFCC). (Image source: Digi-Key Electronics)

As visualized in TensorBoard, the MobileNet architecture replaces all but the first full convolution layer in the conventional CNN architecture with depthwise separable convolutions.

As noted earlier, each of these stages includes depthwise convolution and pointwise convolution stages, each feeding into a batchnorm kernel to normalize the output results (Figure 3, left). The DS-CNN model uses a special TensorFlow fused batchnorm function, which combines several options into a single kernel.

In addition, by zooming into the audio input feature extraction stage (Figure 3, right), developers can examine the audio processing sequence including audio decode, spectrogram generation, and MFCC filtering. The features generated by the MFCC pass through a pair of reshape stages to create the tensor shapes required by the MobileNet classifier.

Developers can conceivably run trained models from TensorFlow or other machine learning frameworks on MCU-based systems including the Raspberry Pi.5 With this approach, developers can quantize the trained models to produce smaller versions able to run on these systems. However, without a graphics processing unit (GPU) or other hardware support for GEMM acceleration, inference latency would likely disappoint user expectations for voice activation performance.

ARM provides an alternative approach through its neural network (NN) extension to the ARM Cortex Microcontroller Software Interface Standard (CMSIS). CMSIS-NN provides a complete set of CNN functions that take full advantage of the DSP extensions built into ARM Cortex-M7 processors such as those in STMicroelectronics’ STM32F7 MCU family. Along with conventional CNN functions, the CMSIS-NN application programming interface (API) supports depthwise separable convolutions with a pair of functions corresponding to the depthwise and pointwise 1 x 1 convolution stages underlying DS-CNN architectures:

ARM_status ARM_depthwise_separable_conv_HWC_q7_nonsquare

ARM_status ARM_convolve_1x1_HWC_q7_fast_nonsquare

The API also provides the two functions in versions designed specifically for square input tensors.

ARM uses these functions in sample code that demonstrates a complete DS-CNN-based KWS application running on the STMicroelectronics STM32F746G-DISCO development board built around the STM32F746NGH6 MCU. At the heart of the sample code, a native CMSIS-NN C++ module implements a CS-DNN (Listing 1).

void DS_CNN::run_nn(q7_t* in_data, q7_t* out_data)
  //CONV1 : regular convolution
  ARM_convolve_HWC_q7_basic_nonsquare(in_data, CONV1_IN_X, CONV1_IN_Y, 1, conv1_wt, CONV1_OUT_CH, CONV1_KX, CONV1_KY, CONV1_PX, CONV1_PY, CONV1_SX, CONV1_SY, conv1_bias, CONV1_BIAS_LSHIFT, CONV1_OUT_RSHIFT, buffer1, CONV1_OUT_X, CONV1_OUT_Y, (q15_t*)col_buffer, NULL);
  //CONV2 : DS + PW conv   //Depthwise separable conv (batch norm params folded into conv wts/bias)   ARM_depthwise_separable_conv_HWC_q7_nonsquare(buffer1,CONV2_IN_X,CONV2_IN_Y,CONV1_OUT_CH,conv2_ds_wt,CONV1_OUT_CH,CONV2_DS_KX,CONV2_DS_KY,CONV2_DS_PX,CONV2_DS_PY,CONV2_DS_SX,CONV2_DS_SY,conv2_ds_bias,CONV2_DS_BIAS_LSHIFT,CONV2_DS_OUT_RSHIFT,buffer2,CONV2_OUT_X,CONV2_OUT_Y,(q15_t*)col_buffer, NULL);   ARM_relu_q7(buffer2,CONV2_OUT_X*CONV2_OUT_Y*CONV2_OUT_CH);     //Pointwise conv   ARM_convolve_1x1_HWC_q7_fast_nonsquare(buffer2, CONV2_OUT_X, CONV2_OUT_Y, CONV1_OUT_CH, conv2_pw_wt, CONV2_OUT_CH, 1, 1, 0, 0, 1, 1, conv2_pw_bias, CONV2_PW_BIAS_LSHIFT, CONV2_PW_OUT_RSHIFT, buffer1, CONV2_OUT_X, CONV2_OUT_Y, (q15_t*)col_buffer, NULL);   ARM_relu_q7(buffer1,CONV2_OUT_X*CONV2_OUT_Y*CONV2_OUT_CH);  

  //CONV3 : DS + PW conv
  //Depthwise separable conv (batch norm params folded into conv wts/bias)
  ARM_depthwise_separable_conv_HWC_q7_nonsquare(buffer1,CONV3_IN_X,CONV3_IN_Y,CONV2_OUT_CH,conv3_ds_wt,CONV2_OUT_CH,CONV3_DS_KX,CONV3_DS_KY,CONV3_DS_PX,CONV3_DS_PY,CONV3_DS_SX,CONV3_DS_SY,conv3_ds_bias,CONV3_DS_BIAS_LSHIFT,CONV3_DS_OUT_RSHIFT,buffer2,CONV3_OUT_X,CONV3_OUT_Y,(q15_t*)col_buffer, NULL);
  //Pointwise conv
  ARM_convolve_1x1_HWC_q7_fast_nonsquare(buffer2, CONV3_OUT_X, CONV3_OUT_Y, CONV2_OUT_CH, conv3_pw_wt, CONV3_OUT_CH, 1, 1, 0, 0, 1, 1, conv3_pw_bias, CONV3_PW_BIAS_LSHIFT, CONV3_PW_OUT_RSHIFT, buffer1, CONV3_OUT_X, CONV3_OUT_Y, (q15_t*)col_buffer, NULL);
  //CONV4 : DS + PW conv
  //Depthwise separable conv (batch norm params folded into conv wts/bias)
  ARM_depthwise_separable_conv_HWC_q7_nonsquare(buffer1,CONV4_IN_X,CONV4_IN_Y,CONV3_OUT_CH,conv4_ds_wt,CONV3_OUT_CH,CONV4_DS_KX,CONV4_DS_KY,CONV4_DS_PX,CONV4_DS_PY,CONV4_DS_SX,CONV4_DS_SY,conv4_ds_bias,CONV4_DS_BIAS_LSHIFT,CONV4_DS_OUT_RSHIFT,buffer2,CONV4_OUT_X,CONV4_OUT_Y,(q15_t*)col_buffer, NULL);
  //Pointwise conv
  ARM_convolve_1x1_HWC_q7_fast_nonsquare(buffer2, CONV4_OUT_X, CONV4_OUT_Y, CONV3_OUT_CH, conv4_pw_wt, CONV4_OUT_CH, 1, 1, 0, 0, 1, 1, conv4_pw_bias, CONV4_PW_BIAS_LSHIFT, CONV4_PW_OUT_RSHIFT, buffer1, CONV4_OUT_X, CONV4_OUT_Y, (q15_t*)col_buffer, NULL);
  //CONV5 : DS + PW conv
  //Depthwise separable conv (batch norm params folded into conv wts/bias)
  ARM_depthwise_separable_conv_HWC_q7_nonsquare(buffer1,CONV5_IN_X,CONV5_IN_Y,CONV4_OUT_CH,conv5_ds_wt,CONV4_OUT_CH,CONV5_DS_KX,CONV5_DS_KY,CONV5_DS_PX,CONV5_DS_PY,CONV5_DS_SX,CONV5_DS_SY,conv5_ds_bias,CONV5_DS_BIAS_LSHIFT,CONV5_DS_OUT_RSHIFT,buffer2,CONV5_OUT_X,CONV5_OUT_Y,(q15_t*)col_buffer, NULL);
  //Pointwise conv
  ARM_convolve_1x1_HWC_q7_fast_nonsquare(buffer2, CONV5_OUT_X, CONV5_OUT_Y, CONV4_OUT_CH, conv5_pw_wt, CONV5_OUT_CH, 1, 1, 0, 0, 1, 1, conv5_pw_bias, CONV5_PW_BIAS_LSHIFT, CONV5_PW_OUT_RSHIFT, buffer1, CONV5_OUT_X, CONV5_OUT_Y, (q15_t*)col_buffer, NULL);
  //Average pool
  ARM_avepool_q7_HWC_nonsquare (buffer1,CONV5_OUT_X,CONV5_OUT_Y,CONV5_OUT_CH,CONV5_OUT_X,CONV5_OUT_Y,0,0,1,1,1,1,NULL,buffer2, 2);
  ARM_fully_connected_q7(buffer2, final_fc_wt, CONV5_OUT_CH, OUT_DIM, FINAL_FC_BIAS_LSHIFT, FINAL_FC_OUT_RSHIFT, final_fc_bias, out_data, (q15_t*)col_buffer);

Listing 1: The ARM ML-KWS-for-MCU software repository includes a C++ DS-CNN model, where a full convolution layer is followed by several depthwise separable convolutions (box), each implemented with depthwise convolution and 1 x 1 convolution functions (yellow highlight) supported in the hardware optimized ARM CMSIS-NN software library. (Code source: ARM)

Although the C++ DS-CNN implementation differs slightly from the TensorBoard DS-CNN model shown earlier, the overall approach remains the same. Following an initial full convolution kernel, a series of depthwise separable convolution kernels feed into final pooling and fully connected layers to generate the prediction values for each output channel (corresponding to the 12 class labels used to train the model).

The KWS application combines this model with code to provide inference of real-time audio streams collected by the STM32F746G-DISCO development board. Here, the main function initializes the inference engine, enables audio sampling, and then enters an endless loop consisting of a single wait-for-interrupt (WFI) call (Listing 2).

char output_class[12][8] = {"Silence", "Unknown","yes","no","up","down",
int main()
  kws = new KWS_F746NG(recording_win,averaging_window_len);
  while (1) {
  /* A dummy loop to wait for the interrupts. Feature extraction and
     neural network inference are done in the interrupt service routine. */
 * The audio recording works with two ping-pong buffers.
 * The data for each window will be tranfered by the DMA, which sends
 * sends an interrupt after the transfer is completed.
// Manages the DMA Transfer complete interrupt.
void BSP_AUDIO_IN_TransferComplete_CallBack(void)
  ARM_copy_q7((q7_t *)kws->audio_buffer_in + kws->audio_block_size*4, (q7_t *)kws->audio_buffer_out + kws->audio_block_size*4, kws->audio_block_size*4);
  if(kws->frame_len != kws->frame_shift) {
    //copy the last (frame_len - frame_shift) audio data to the start
    ARM_copy_q7((q7_t *)(kws->audio_buffer)+2*(kws->audio_buffer_size-(kws->frame_len-kws->frame_shift)), (q7_t *)kws->audio_buffer, 2*(kws->frame_len-kws->frame_shift));
  // copy the new recording data 
  for (int i=0;i<kws->audio_block_size;i++) {
    kws->audio_buffer[kws->frame_len-kws->frame_shift+i] = kws->audio_buffer_in[2*kws->audio_block_size+i*2];
// Manages the DMA Half Transfer complete interrupt.
void BSP_AUDIO_IN_HalfTransfer_CallBack(void)
  ARM_copy_q7((q7_t *)kws->audio_buffer_in, (q7_t *)kws->audio_buffer_out, kws->audio_block_size*4);
  if(kws->frame_len!=kws->frame_shift) {
    //copy the last (frame_len - frame_shift) audio data to the start
    ARM_copy_q7((q7_t *)(kws->audio_buffer)+2*(kws->audio_buffer_size-(kws->frame_len-kws->frame_shift)), (q7_t *)kws->audio_buffer, 2*(kws->frame_len-kws->frame_shift));
  // copy the new recording data 
  for (int i=0;i<kws->audio_block_size;i++) {
    kws->audio_buffer[kws->frame_len-kws->frame_shift+i] = kws->audio_buffer_in[i*2];
void run_kws()
  kws->extract_features();    //extract mfcc features
  kws->classify();       //classify using dnn
  int max_ind = kws->get_top_class(kws->averaged_output);
    sprintf(lcd_output_string,"%d%% %s",((int)kws->averaged_output[max_ind]*100/128),output_class[max_ind]);
  lcd.DisplayStringAt(0, LINE(8), (uint8_t *) lcd_output_string, CENTER_MODE);

Listing 2: In the ARM ML-KWS-for-MCU software repository, the main routine for the DS-CNN KWS application instantiates the inference engine (through KWS_F746NG), activates the STM32F746G-DISCO development board's audio subsystem, and enters an endless loop, waiting for interrupts to call completion routines that perform inference (run_kws()). (Code source: ARM)

Included in this main routine, callback functions provide completion routines that buffer the recorded data and begin the inference process itself with a call to run_kws(). The run_kws function invokes calls on the inference engine instance to extract features, classify the result, and provide predictions that indicate the probability that the recorded audio sample belongs to one of the 12 classes used in the original training as described previously.

The inference engine itself is instantiated through a series of calls, starting with the call in main that instantiates the KWS_F746NG class, which itself is a subclass of the KWS_DS_NN class. This latter class encapsulates the C++ DS-CNN model shown earlier with a parent class KWS, which implements the specific inference engine methods: extract_features(), classify(), and more (Listing 3).

#include "kws.h"
  delete mfcc;
  delete mfcc_buffer;
  delete output;
  delete predictions;
  delete averaged_output;
void KWS::init_kws()
  num_mfcc_features = nn->get_num_mfcc_features();
  num_frames = nn->get_num_frames();
  frame_len = nn->get_frame_len();
  frame_shift = nn->get_frame_shift();
  int mfcc_dec_bits = nn->get_in_dec_bits();
  num_out_classes = nn->get_num_out_classes();
  mfcc = new MFCC(num_mfcc_features, frame_len, mfcc_dec_bits);
  mfcc_buffer = new q7_t[num_frames*num_mfcc_features];
  output = new q7_t[num_out_classes];
  averaged_output = new q7_t[num_out_classes];
  predictions = new q7_t[sliding_window_len*num_out_classes];
  audio_block_size = recording_win*frame_shift;
  audio_buffer_size = audio_block_size + frame_len - frame_shift;
void KWS::extract_features() 
  if(num_frames>recording_win) {
    //move old features left 
  //compute features only for the newly recorded audio
  int32_t mfcc_buffer_head = (num_frames-recording_win)*num_mfcc_features; 
  for (uint16_t f = 0; f < recording_win; f++) {
    mfcc_buffer_head += num_mfcc_features;
void KWS::classify()
  nn->run_nn(mfcc_buffer, output);
  // Softmax
int KWS::get_top_class(q7_t* prediction)
  int max_ind=0;
  int max_val=-128;
  for(int i=0;i<num_out_classes;i++) {
    if(max_val<prediction[i]) {
      max_val = prediction[i];
      max_ind = i;
  return max_ind;

void KWS::average_predictions()
  //shift right old predictions 
  ARM_copy_q7((q7_t *)predictions, (q7_t *)(predictions+num_out_classes), (sliding_window_len-1)*num_out_classes);
  //add new predictions
  ARM_copy_q7((q7_t *)output, (q7_t *)predictions, num_out_classes);
  //compute averages
  int sum;
  for(int j=0;j<num_out_classes;j++) {
    for(int i=0;i<sliding_window_len;i++) 
      sum += predictions[i*num_out_classes+j];
    averaged_output[j] = (q7_t)(sum/sliding_window_len);

Listing 3: In the ARM DS-CNN KWS application, a KWS module adds methods on the base DS-CNN class needed to perform inference operations including feature extraction, classification, and generation of results smoothed by an averaging window. (Code source: ARM)

All of this software complexity is hidden behind a simple use model where the main routine starts the process by instantiating the inference engine and using its completion routine to perform inference as audio input becomes available. According to ARM, this sample CMSIS-NN implementation running on the STM32F746G-DISCO development board needs only about 12 milliseconds (ms) to complete an inference cycle, which includes audio data buffer copying, feature extraction, and DS-CNN model execution. Just as important, the complete KWS application requires only about 70 Kbytes of memory.


As KWS capability becomes increasingly important as a requirement, developers of resource limited wearables and other IoT designs need small footprint inference engines. Built to leverage DSP features in ARM Cortex-M7 MCUs, the ARM CMSIS-NN provides the foundation for implementing optimized neural network architectures, such as DS-CNNs, able to meet these requirements.

Running on an ARM Cortex-M7 MCU-based development system, a KWS inference engine can achieve performance approaching 10 inferences/s in a memory footprint easily supported by resource limited IoT devices.



Disclaimer: The opinions, beliefs, and viewpoints expressed by the various authors and/or forum participants on this website do not necessarily reflect the opinions, beliefs, and viewpoints of Digi-Key Electronics or official policies of Digi-Key Electronics.

About this author

Stephen Evanczuk

Stephen Evanczuk has more than 20 years of experience writing for and about the electronics industry on a wide range of topics including hardware, software, systems, and applications including the IoT. He received his Ph.D. in neuroscience on neuronal networks and worked in the aerospace industry on massively distributed secure systems and algorithm acceleration methods. Currently, when he's not writing articles on technology and engineering, he's working on applications of deep learning to recognition and recommendation systems.

About this publisher

Digi-Key's North American Editors