SAM3-LiteText: An Anatomical Study of the SAM3 Text Encoder for Efficient Vision-Language Segmentation

Chengxi Zeng; Yuxuan Jiang; Ge Gao; Shuai Wang; Duolikun Danier; Bin Zhu; Stevan Rudinac; David Bull; Fan Zhang

SAM3-LiteText: An Anatomical Study of the SAM3 Text Encoder for Efficient Vision-Language Segmentation

Chengxi Zeng, Yuxuan Jiang, Ge Gao, Shuai Wang, Duolikun Danier, Bin Zhu, Stevan Rudinac, David Bull, Fan Zhang

TL;DR

This work identifies substantial inefficiencies in SAM3's text encoder for vision-language segmentation by analyzing 404,796 prompts across multiple datasets. It shows prompts are short, vocabulary usage is sparse, and the output embeddings lie on a low-dimensional manifold with intrinsic dimensionality around $16$--$19$, revealing strong over-provisioning. Leveraging these insights, the authors implement SAM3-LiteText via domain-aware knowledge distillation from the SAM3 teacher to MobileCLIP variants with a reduced context length $L=16$, achieving up to $88\%$ parameter reduction while preserving about $98.1\%$ of the teacher's performance. The approach enables effective on-device, edge deployment for segmentation tasks with reduced static memory and modest latency gains, offering a practical path toward democratizing access to vision-language capabilities on resource-constrained hardware. Overall, the paper demonstrates that task-specific prompt structures enable dramatic, principled compression of text encoders without sacrificing downstream grounding accuracy.

Abstract

Vision-language segmentation models such as SAM3 enable flexible, prompt-driven visual grounding, but inherit large, general-purpose text encoders originally designed for open-ended language understanding. In practice, segmentation prompts are short, structured, and semantically constrained, leading to substantial over-provisioning in text encoder capacity and persistent computational and memory overhead. In this paper, we perform a large-scale anatomical analysis of text prompting in vision-language segmentation, covering 404,796 real prompts across multiple benchmarks. Our analysis reveals severe redundancy: most context windows are underutilized, vocabulary usage is highly sparse, and text embeddings lie on low-dimensional manifold despite high-dimensional representations. Motivated by these findings, we propose SAM3-LiteText, a lightweight text encoding framework that replaces the original SAM3 text encoder with a compact MobileCLIP student that is optimized by knowledge distillation. Extensive experiments on image and video segmentation benchmarks show that SAM3-LiteText reduces text encoder parameters by up to 88%, substantially reducing static memory footprint, while maintaining segmentation performance comparable to the original model. Code: https://github.com/SimonZeng7108/efficientsam3/tree/sam3_litetext.

SAM3-LiteText: An Anatomical Study of the SAM3 Text Encoder for Efficient Vision-Language Segmentation

TL;DR

, revealing strong over-provisioning. Leveraging these insights, the authors implement SAM3-LiteText via domain-aware knowledge distillation from the SAM3 teacher to MobileCLIP variants with a reduced context length

, achieving up to

parameter reduction while preserving about

of the teacher's performance. The approach enables effective on-device, edge deployment for segmentation tasks with reduced static memory and modest latency gains, offering a practical path toward democratizing access to vision-language capabilities on resource-constrained hardware. Overall, the paper demonstrates that task-specific prompt structures enable dramatic, principled compression of text encoders without sacrificing downstream grounding accuracy.

Abstract

Paper Structure (38 sections, 6 equations, 4 figures, 8 tables)

This paper contains 38 sections, 6 equations, 4 figures, 8 tables.

Introduction
Related Work
Vision-Language Segmentation.
Multi-Object Tracking (MOT) and Segmentation.
Efficient Vision Foundation Models.
Efficient Text Encoders.
Anatomical Analysis: Quantifying Text Encoder Redundancy
Prompt Statistics and Context Window
Datasets and Preprocessing
Token Length Distribution
Context Window Efficiency
Vocabulary Coverage
Embedding Space Analysis
Token Embedding SVD Analysis
Positional Embedding Similarity Analysis
...and 23 more sections

Figures (4)

Figure 1: Token length distribution across 404,796 unique prompts from six sources: RF100-VL, SA-Co-Gold, SA-Co-Silver, SA-Co-VEval, LVIS, and RefCOCO. Each dataset component is shown separately, revealing distinct prompt length characteristics. The combined mean is $\mu$=7.9 tokens.
Figure 2: Vocabulary coverage analysis. (a) Only 35% of the 49,408 BPE tokens are ever used in segmentation prompts. (b) Token frequency is highly skewed---the top 100 tokens cover 58.5% of all occurrences.
Figure 3: Positional embedding similarity analysis. The cosine similarity heatmap shows high correlation among late positions (8+), which are 1.5$\times$ more similar within-group than positions 0--7.
Figure 4: Output embedding intrinsic dimensionality. Utilization estimates indicate that only $\sim$6--8% of the 256-dimensional space is actually used.

SAM3-LiteText: An Anatomical Study of the SAM3 Text Encoder for Efficient Vision-Language Segmentation

TL;DR

Abstract

SAM3-LiteText: An Anatomical Study of the SAM3 Text Encoder for Efficient Vision-Language Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)