Table of Contents
Fetching ...

TinyVLM: Zero-Shot Object Detection on Microcontrollers via Vision-Language Distillation with Matryoshka Embeddings

Bibin Wilson

TL;DR

TinyVLM is presented, the first framework enabling zero-shot object detection on resource-constrained MCUs with less than 1MB of memory, and introduces a decoupled architecture that separates visual inference from text encoding, allowing precomputed class embeddings to be stored in flash memory.

Abstract

Zero-shot object detection enables recognising novel objects without task-specific training, but current approaches rely on large vision language models (VLMs) like CLIP that require hundreds of megabytes of memory - far exceeding the constraints of micro controller units (MCUs). We present TinyVLM, the first framework enabling zero-shot object detection on resource-constrained MCUs with less than 1MB of memory. Our approach introduces three key innovations: (1) a decoupled architecture that separates visual inference from text encoding, allowing precomputed class embeddings to be stored in flash memory; (2) Matryoshka distillation that trains nested embeddings at multiple dimensions (16-256), enabling flexible accuracy-memory trade-offs; and (3) quantized embedding storage that reduces class prototype memory by 4x with minimal accuracy loss. Trained on Conceptual Captions 3M (CC3M), TinyVLM achieves competitive zero-shot accuracy on COCO, Flowers102, and Food101 while requiring only 285KB of RAM and 892KB of flash memory for the deployed vision encoder. We demonstrate real-time inference at 26 FPS on STM32H7 and over 1,000 FPS on MAX78000 with its CNN accelerator, enabling practical zero-shot detection on edge devices for the first time.

TinyVLM: Zero-Shot Object Detection on Microcontrollers via Vision-Language Distillation with Matryoshka Embeddings

TL;DR

TinyVLM is presented, the first framework enabling zero-shot object detection on resource-constrained MCUs with less than 1MB of memory, and introduces a decoupled architecture that separates visual inference from text encoding, allowing precomputed class embeddings to be stored in flash memory.

Abstract

Zero-shot object detection enables recognising novel objects without task-specific training, but current approaches rely on large vision language models (VLMs) like CLIP that require hundreds of megabytes of memory - far exceeding the constraints of micro controller units (MCUs). We present TinyVLM, the first framework enabling zero-shot object detection on resource-constrained MCUs with less than 1MB of memory. Our approach introduces three key innovations: (1) a decoupled architecture that separates visual inference from text encoding, allowing precomputed class embeddings to be stored in flash memory; (2) Matryoshka distillation that trains nested embeddings at multiple dimensions (16-256), enabling flexible accuracy-memory trade-offs; and (3) quantized embedding storage that reduces class prototype memory by 4x with minimal accuracy loss. Trained on Conceptual Captions 3M (CC3M), TinyVLM achieves competitive zero-shot accuracy on COCO, Flowers102, and Food101 while requiring only 285KB of RAM and 892KB of flash memory for the deployed vision encoder. We demonstrate real-time inference at 26 FPS on STM32H7 and over 1,000 FPS on MAX78000 with its CNN accelerator, enabling practical zero-shot detection on edge devices for the first time.
Paper Structure (65 sections, 15 equations, 9 figures, 9 tables, 1 algorithm)

This paper contains 65 sections, 15 equations, 9 figures, 9 tables, 1 algorithm.

Figures (9)

  • Figure 1: TinyVLM enables zero-shot detection on MCUs. While existing VLMs require 18--350MB, TinyVLM achieves zero-shot capability within MCU memory constraints ($<$1MB flash for the deployed vision encoder) through Matryoshka distillation and decoupled architecture.
  • Figure 2: TinyVLM overview. (Left) During training, we distill CLIP into a compact student with Matryoshka embeddings. (Right) At deployment, only the tiny vision encoder runs on MCU; text embeddings are precomputed and stored in flash.
  • Figure 3: Memory layout for TinyVLM on STM32H7.
  • Figure 4: Relative accuracy vs. embedding dimension on COCO (compared to CLIP ViT-B/32). Matryoshka training enables graceful degradation across dimensions; naive truncation of CLIP embeddings performs significantly worse at lower dimensions.
  • Figure 5: Zero-shot accuracy across Matryoshka embedding dimensions. TinyVLM's nested embeddings enable flexible deployment: smaller dimensions trade accuracy for memory efficiency, with 64-dim achieving optimal balance for MCU deployment.
  • ...and 4 more figures