Table of Contents
Fetching ...

SAM3-LiteText: An Anatomical Study of the SAM3 Text Encoder for Efficient Vision-Language Segmentation

Chengxi Zeng, Yuxuan Jiang, Ge Gao, Shuai Wang, Duolikun Danier, Bin Zhu, Stevan Rudinac, David Bull, Fan Zhang

TL;DR

This work identifies substantial inefficiencies in SAM3's text encoder for vision-language segmentation by analyzing 404,796 prompts across multiple datasets. It shows prompts are short, vocabulary usage is sparse, and the output embeddings lie on a low-dimensional manifold with intrinsic dimensionality around $16$--$19$, revealing strong over-provisioning. Leveraging these insights, the authors implement SAM3-LiteText via domain-aware knowledge distillation from the SAM3 teacher to MobileCLIP variants with a reduced context length $L=16$, achieving up to $88\%$ parameter reduction while preserving about $98.1\%$ of the teacher's performance. The approach enables effective on-device, edge deployment for segmentation tasks with reduced static memory and modest latency gains, offering a practical path toward democratizing access to vision-language capabilities on resource-constrained hardware. Overall, the paper demonstrates that task-specific prompt structures enable dramatic, principled compression of text encoders without sacrificing downstream grounding accuracy.

Abstract

Vision-language segmentation models such as SAM3 enable flexible, prompt-driven visual grounding, but inherit large, general-purpose text encoders originally designed for open-ended language understanding. In practice, segmentation prompts are short, structured, and semantically constrained, leading to substantial over-provisioning in text encoder capacity and persistent computational and memory overhead. In this paper, we perform a large-scale anatomical analysis of text prompting in vision-language segmentation, covering 404,796 real prompts across multiple benchmarks. Our analysis reveals severe redundancy: most context windows are underutilized, vocabulary usage is highly sparse, and text embeddings lie on low-dimensional manifold despite high-dimensional representations. Motivated by these findings, we propose SAM3-LiteText, a lightweight text encoding framework that replaces the original SAM3 text encoder with a compact MobileCLIP student that is optimized by knowledge distillation. Extensive experiments on image and video segmentation benchmarks show that SAM3-LiteText reduces text encoder parameters by up to 88%, substantially reducing static memory footprint, while maintaining segmentation performance comparable to the original model. Code: https://github.com/SimonZeng7108/efficientsam3/tree/sam3_litetext.

SAM3-LiteText: An Anatomical Study of the SAM3 Text Encoder for Efficient Vision-Language Segmentation

TL;DR

This work identifies substantial inefficiencies in SAM3's text encoder for vision-language segmentation by analyzing 404,796 prompts across multiple datasets. It shows prompts are short, vocabulary usage is sparse, and the output embeddings lie on a low-dimensional manifold with intrinsic dimensionality around --, revealing strong over-provisioning. Leveraging these insights, the authors implement SAM3-LiteText via domain-aware knowledge distillation from the SAM3 teacher to MobileCLIP variants with a reduced context length , achieving up to parameter reduction while preserving about of the teacher's performance. The approach enables effective on-device, edge deployment for segmentation tasks with reduced static memory and modest latency gains, offering a practical path toward democratizing access to vision-language capabilities on resource-constrained hardware. Overall, the paper demonstrates that task-specific prompt structures enable dramatic, principled compression of text encoders without sacrificing downstream grounding accuracy.

Abstract

Vision-language segmentation models such as SAM3 enable flexible, prompt-driven visual grounding, but inherit large, general-purpose text encoders originally designed for open-ended language understanding. In practice, segmentation prompts are short, structured, and semantically constrained, leading to substantial over-provisioning in text encoder capacity and persistent computational and memory overhead. In this paper, we perform a large-scale anatomical analysis of text prompting in vision-language segmentation, covering 404,796 real prompts across multiple benchmarks. Our analysis reveals severe redundancy: most context windows are underutilized, vocabulary usage is highly sparse, and text embeddings lie on low-dimensional manifold despite high-dimensional representations. Motivated by these findings, we propose SAM3-LiteText, a lightweight text encoding framework that replaces the original SAM3 text encoder with a compact MobileCLIP student that is optimized by knowledge distillation. Extensive experiments on image and video segmentation benchmarks show that SAM3-LiteText reduces text encoder parameters by up to 88%, substantially reducing static memory footprint, while maintaining segmentation performance comparable to the original model. Code: https://github.com/SimonZeng7108/efficientsam3/tree/sam3_litetext.
Paper Structure (38 sections, 6 equations, 4 figures, 8 tables)

This paper contains 38 sections, 6 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Token length distribution across 404,796 unique prompts from six sources: RF100-VL, SA-Co-Gold, SA-Co-Silver, SA-Co-VEval, LVIS, and RefCOCO. Each dataset component is shown separately, revealing distinct prompt length characteristics. The combined mean is $\mu$=7.9 tokens.
  • Figure 2: Vocabulary coverage analysis. (a) Only 35% of the 49,408 BPE tokens are ever used in segmentation prompts. (b) Token frequency is highly skewed---the top 100 tokens cover 58.5% of all occurrences.
  • Figure 3: Positional embedding similarity analysis. The cosine similarity heatmap shows high correlation among late positions (8+), which are 1.5$\times$ more similar within-group than positions 0--7.
  • Figure 4: Output embedding intrinsic dimensionality. Utilization estimates indicate that only $\sim$6--8% of the 256-dimensional space is actually used.