Table of Contents
Fetching ...

Visual Wake Words Dataset

Aakanksha Chowdhery, Pete Warden, Jonathon Shlens, Andrew Howard, Rocky Rhodes

TL;DR

The paper tackles the challenge of deploying computer vision on memory-constrained microcontrollers by introducing the Visual Wake Words dataset, a COCO-derived binary task (person vs not-person) designed to benchmark tiny vision models within a 250 KB memory footprint and under 60M MAdds. It evaluates state-of-the-art mobile architectures (MobileNet V1/V2, MNasNet, ShuffleNet) with 8-bit quantization and reveals that these models can reach 85–90% accuracy on Visual Wake Words while staying within memory and latency budgets. The work analyzes memory-latency trade-offs, highlighting peak SRAM usage dominated by early layers and practical strategies to manage buffers, with measured on-device latency around 1.3 seconds per inference on a STM32 platform. Overall, Visual Wake Words provides a practical platform to push the pareto-optimal boundary of accuracy vs memory for microcontroller vision, guiding futuretiny-model design and optimization.

Abstract

The emergence of Internet of Things (IoT) applications requires intelligence on the edge. Microcontrollers provide a low-cost compute platform to deploy intelligent IoT applications using machine learning at scale, but have extremely limited on-chip memory and compute capability. To deploy computer vision on such devices, we need tiny vision models that fit within a few hundred kilobytes of memory footprint in terms of peak usage and model size on device storage. To facilitate the development of microcontroller friendly models, we present a new dataset, Visual Wake Words, that represents a common microcontroller vision use-case of identifying whether a person is present in the image or not, and provides a realistic benchmark for tiny vision models. Within a limited memory footprint of 250 KB, several state-of-the-art mobile models achieve accuracy of 85-90% on the Visual Wake Words dataset. We anticipate the proposed dataset will advance the research on tiny vision models that can push the pareto-optimal boundary in terms of accuracy versus memory usage for microcontroller applications.

Visual Wake Words Dataset

TL;DR

The paper tackles the challenge of deploying computer vision on memory-constrained microcontrollers by introducing the Visual Wake Words dataset, a COCO-derived binary task (person vs not-person) designed to benchmark tiny vision models within a 250 KB memory footprint and under 60M MAdds. It evaluates state-of-the-art mobile architectures (MobileNet V1/V2, MNasNet, ShuffleNet) with 8-bit quantization and reveals that these models can reach 85–90% accuracy on Visual Wake Words while staying within memory and latency budgets. The work analyzes memory-latency trade-offs, highlighting peak SRAM usage dominated by early layers and practical strategies to manage buffers, with measured on-device latency around 1.3 seconds per inference on a STM32 platform. Overall, Visual Wake Words provides a practical platform to push the pareto-optimal boundary of accuracy vs memory for microcontroller vision, guiding futuretiny-model design and optimization.

Abstract

The emergence of Internet of Things (IoT) applications requires intelligence on the edge. Microcontrollers provide a low-cost compute platform to deploy intelligent IoT applications using machine learning at scale, but have extremely limited on-chip memory and compute capability. To deploy computer vision on such devices, we need tiny vision models that fit within a few hundred kilobytes of memory footprint in terms of peak usage and model size on device storage. To facilitate the development of microcontroller friendly models, we present a new dataset, Visual Wake Words, that represents a common microcontroller vision use-case of identifying whether a person is present in the image or not, and provides a realistic benchmark for tiny vision models. Within a limited memory footprint of 250 KB, several state-of-the-art mobile models achieve accuracy of 85-90% on the Visual Wake Words dataset. We anticipate the proposed dataset will advance the research on tiny vision models that can push the pareto-optimal boundary in terms of accuracy versus memory usage for microcontroller applications.

Paper Structure

This paper contains 24 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Sample images labeled to 'person' and 'not-person' categories from COCO training dataset.
  • Figure 2: On ImageNet dataset, we compare the top-1 accuracy versus the memory footprint, model size and multiply-adds per inference. Figure (a) shows the top-1 accuracy vs estimated peak memory usage (in KB), (b) the top-1 accuracy vs number of parameters (in KB), and figure (c) shows the top-1 accuracy vs multiply-adds (in millions). Each point corresponds to different image resolution in {96, 128, 160, 192, 224}.
  • Figure 3: On Visual Wake Words dataset, we compare the accuracy versus the memory footprint, model size and multiply-adds per inference. Figure (a) shows the accuracy vs estimated peak memory usage (in KB), (b) the accuracy vs number of parameters (in KB), and figure (c) shows the accuracy vs multiply-adds (in millions). Each point corresponds to different image resolution in {96, 128, 160, 192, 224}. Note that the red and black points are overlapping.
  • Figure 4: Temporary buffer management for MobileNet V1 (left) and MobileNet V2 (right).