TinyVLM: Zero-Shot Object Detection on Microcontrollers via Vision-Language Distillation with Matryoshka Embeddings

Bibin Wilson

TinyVLM: Zero-Shot Object Detection on Microcontrollers via Vision-Language Distillation with Matryoshka Embeddings

Bibin Wilson

TL;DR

TinyVLM is presented, the first framework enabling zero-shot object detection on resource-constrained MCUs with less than 1MB of memory, and introduces a decoupled architecture that separates visual inference from text encoding, allowing precomputed class embeddings to be stored in flash memory.

Abstract

Zero-shot object detection enables recognising novel objects without task-specific training, but current approaches rely on large vision language models (VLMs) like CLIP that require hundreds of megabytes of memory - far exceeding the constraints of micro controller units (MCUs). We present TinyVLM, the first framework enabling zero-shot object detection on resource-constrained MCUs with less than 1MB of memory. Our approach introduces three key innovations: (1) a decoupled architecture that separates visual inference from text encoding, allowing precomputed class embeddings to be stored in flash memory; (2) Matryoshka distillation that trains nested embeddings at multiple dimensions (16-256), enabling flexible accuracy-memory trade-offs; and (3) quantized embedding storage that reduces class prototype memory by 4x with minimal accuracy loss. Trained on Conceptual Captions 3M (CC3M), TinyVLM achieves competitive zero-shot accuracy on COCO, Flowers102, and Food101 while requiring only 285KB of RAM and 892KB of flash memory for the deployed vision encoder. We demonstrate real-time inference at 26 FPS on STM32H7 and over 1,000 FPS on MAX78000 with its CNN accelerator, enabling practical zero-shot detection on edge devices for the first time.

TinyVLM: Zero-Shot Object Detection on Microcontrollers via Vision-Language Distillation with Matryoshka Embeddings

TL;DR

Abstract

Paper Structure (65 sections, 15 equations, 9 figures, 9 tables, 1 algorithm)

This paper contains 65 sections, 15 equations, 9 figures, 9 tables, 1 algorithm.

Introduction
Related Work
Vision-Language Models
Efficient Vision-Language Models
Knowledge Distillation for VLMs
Matryoshka Representations
TinyML and MCU Deployment
Method
Problem Formulation
Decoupled Architecture
Matryoshka Distillation
Matryoshka Embedding Structure
Training Objective
Contrastive Loss.
Embedding Distillation Loss.
...and 50 more sections

Figures (9)

Figure 1: TinyVLM enables zero-shot detection on MCUs. While existing VLMs require 18--350MB, TinyVLM achieves zero-shot capability within MCU memory constraints ($<$1MB flash for the deployed vision encoder) through Matryoshka distillation and decoupled architecture.
Figure 2: TinyVLM overview. (Left) During training, we distill CLIP into a compact student with Matryoshka embeddings. (Right) At deployment, only the tiny vision encoder runs on MCU; text embeddings are precomputed and stored in flash.
Figure 3: Memory layout for TinyVLM on STM32H7.
Figure 4: Relative accuracy vs. embedding dimension on COCO (compared to CLIP ViT-B/32). Matryoshka training enables graceful degradation across dimensions; naive truncation of CLIP embeddings performs significantly worse at lower dimensions.
Figure 5: Zero-shot accuracy across Matryoshka embedding dimensions. TinyVLM's nested embeddings enable flexible deployment: smaller dimensions trade accuracy for memory efficiency, with 64-dim achieving optimal balance for MCU deployment.
...and 4 more figures

TinyVLM: Zero-Shot Object Detection on Microcontrollers via Vision-Language Distillation with Matryoshka Embeddings

TL;DR

Abstract

TinyVLM: Zero-Shot Object Detection on Microcontrollers via Vision-Language Distillation with Matryoshka Embeddings

Authors

TL;DR

Abstract

Table of Contents

Figures (9)