Table of Contents
Fetching ...

PicoSAM2: Low-Latency Segmentation In-Sensor for Edge Vision Applications

Pietro Bonazzi, Nicola Farronato, Stefan Zihlmann, Haotong Qin, Michele Magno

TL;DR

PicoSAM2 targets real-time, on-device segmentation under tight edge hardware constraints by reengineering SAM2-inspired promptable segmentation into a compact, in-sensor-friendly CNN design. It employs a depthwise separable U-Net with fixed-point prompt encoding realized through centered training crops, and uses knowledge distillation from SAM2 with a dynamic loss to balance soft teacher guidance and hard ground-truth supervision. Trained on COCO and evaluated on LVIS, PicoSAM2 achieves 51.9% mIoU on COCO and 44.9% on LVIS with 1.3M parameters (1.22 MB quantized) and 336M MACs, and runs at 14.3 ms on the Sony IMX500, meeting both memory (<8 MB) and operator constraints. Distillation significantly boosts LVIS performance (+3.5% mIoU and +5.1% mAP), demonstrating that efficient, promptable segmentation is feasible directly on-camera, enabling privacy-preserving edge vision without cloud or host processing.

Abstract

Real-time, on-device segmentation is critical for latency-sensitive and privacy-aware applications like smart glasses and IoT devices. We introduce PicoSAM2, a lightweight (1.3M parameters, 336M MACs) promptable segmentation model optimized for edge and in-sensor execution, including the Sony IMX500. It builds on a depthwise separable U-Net, with knowledge distillation and fixed-point prompt encoding to learn from the Segment Anything Model 2 (SAM2). On COCO and LVIS, it achieves 51.9% and 44.9% mIoU, respectively. The quantized model (1.22MB) runs at 14.3 ms on the IMX500-achieving 86 MACs/cycle, making it the only model meeting both memory and compute constraints for in-sensor deployment. Distillation boosts LVIS performance by +3.5% mIoU and +5.1% mAP. These results demonstrate that efficient, promptable segmentation is feasible directly on-camera, enabling privacy-preserving vision without cloud or host processing.

PicoSAM2: Low-Latency Segmentation In-Sensor for Edge Vision Applications

TL;DR

PicoSAM2 targets real-time, on-device segmentation under tight edge hardware constraints by reengineering SAM2-inspired promptable segmentation into a compact, in-sensor-friendly CNN design. It employs a depthwise separable U-Net with fixed-point prompt encoding realized through centered training crops, and uses knowledge distillation from SAM2 with a dynamic loss to balance soft teacher guidance and hard ground-truth supervision. Trained on COCO and evaluated on LVIS, PicoSAM2 achieves 51.9% mIoU on COCO and 44.9% on LVIS with 1.3M parameters (1.22 MB quantized) and 336M MACs, and runs at 14.3 ms on the Sony IMX500, meeting both memory (<8 MB) and operator constraints. Distillation significantly boosts LVIS performance (+3.5% mIoU and +5.1% mAP), demonstrating that efficient, promptable segmentation is feasible directly on-camera, enabling privacy-preserving edge vision without cloud or host processing.

Abstract

Real-time, on-device segmentation is critical for latency-sensitive and privacy-aware applications like smart glasses and IoT devices. We introduce PicoSAM2, a lightweight (1.3M parameters, 336M MACs) promptable segmentation model optimized for edge and in-sensor execution, including the Sony IMX500. It builds on a depthwise separable U-Net, with knowledge distillation and fixed-point prompt encoding to learn from the Segment Anything Model 2 (SAM2). On COCO and LVIS, it achieves 51.9% and 44.9% mIoU, respectively. The quantized model (1.22MB) runs at 14.3 ms on the IMX500-achieving 86 MACs/cycle, making it the only model meeting both memory and compute constraints for in-sensor deployment. Distillation boosts LVIS performance by +3.5% mIoU and +5.1% mAP. These results demonstrate that efficient, promptable segmentation is feasible directly on-camera, enabling privacy-preserving vision without cloud or host processing.

Paper Structure

This paper contains 8 sections, 1 equation, 4 figures, 1 table.

Figures (4)

  • Figure 1: Comparison of segmentation models: Latency vs. Memory
  • Figure 2: Qualitative comparison of each model's mask inference.
  • Figure 3: Schematic of the PicoSAM2 architecture.
  • Figure 4: Segmentation accuracy (mIoU) and precision (mAP) vs. model size on LVIS (log scale).