Visual Wake Words Dataset
Aakanksha Chowdhery, Pete Warden, Jonathon Shlens, Andrew Howard, Rocky Rhodes
TL;DR
The paper tackles the challenge of deploying computer vision on memory-constrained microcontrollers by introducing the Visual Wake Words dataset, a COCO-derived binary task (person vs not-person) designed to benchmark tiny vision models within a 250 KB memory footprint and under 60M MAdds. It evaluates state-of-the-art mobile architectures (MobileNet V1/V2, MNasNet, ShuffleNet) with 8-bit quantization and reveals that these models can reach 85–90% accuracy on Visual Wake Words while staying within memory and latency budgets. The work analyzes memory-latency trade-offs, highlighting peak SRAM usage dominated by early layers and practical strategies to manage buffers, with measured on-device latency around 1.3 seconds per inference on a STM32 platform. Overall, Visual Wake Words provides a practical platform to push the pareto-optimal boundary of accuracy vs memory for microcontroller vision, guiding futuretiny-model design and optimization.
Abstract
The emergence of Internet of Things (IoT) applications requires intelligence on the edge. Microcontrollers provide a low-cost compute platform to deploy intelligent IoT applications using machine learning at scale, but have extremely limited on-chip memory and compute capability. To deploy computer vision on such devices, we need tiny vision models that fit within a few hundred kilobytes of memory footprint in terms of peak usage and model size on device storage. To facilitate the development of microcontroller friendly models, we present a new dataset, Visual Wake Words, that represents a common microcontroller vision use-case of identifying whether a person is present in the image or not, and provides a realistic benchmark for tiny vision models. Within a limited memory footprint of 250 KB, several state-of-the-art mobile models achieve accuracy of 85-90% on the Visual Wake Words dataset. We anticipate the proposed dataset will advance the research on tiny vision models that can push the pareto-optimal boundary in terms of accuracy versus memory usage for microcontroller applications.
