SlimSAM: 0.1% Data Makes Segment Anything Slim

Zigeng Chen; Gongfan Fang; Xinyin Ma; Xinchao Wang

SlimSAM: 0.1% Data Makes Segment Anything Slim

Zigeng Chen, Gongfan Fang, Xinyin Ma, Xinchao Wang

TL;DR

SlimSAM tackles the challenge of compressing Segment Anything Model (SAM) with extremely limited training data. It introduces alternate slimming, which prunes and distills decoupled sub-structures (embedding and bottleneck) in alternating steps, and disturbed Taylor pruning to align pruning with distillation targets in a label-free setting. By decomposing the model and using dynamic distillation losses across embedding and bottleneck stages, SlimSAM achieves substantial parameter and MAC reductions (to $1.4\%$ and $0.8\%$ of the original) while using only $0.1\%$ of the training data, with inference speedups up to $8.6\times$ on a Titan RTX. The approach outperforms existing SAM compression methods under the same or lower data budgets, demonstrating that effective knowledge inheritance from a pre-trained SAM can be preserved with minimal data and compute.

Abstract

Current approaches for compressing the Segment Anything Model (SAM) yield commendable results, yet necessitate extensive data to train a new network from scratch. Employing conventional pruning techniques can remarkably reduce data requirements but would suffer from a degradation in performance. To address this challenging trade-off, we introduce SlimSAM, a novel data-efficient SAM compression method that achieves superior performance with extremely less training data. The essence of SlimSAM is encapsulated in the alternate slimming framework which effectively enhances knowledge inheritance under severely limited training data availability and exceptional pruning ratio. Diverging from prior techniques, our framework progressively compresses the model by alternately pruning and distilling distinct, decoupled sub-structures. Disturbed Taylor pruning is also proposed to address the misalignment between the pruning objective and training target, thereby boosting the post-distillation after pruning. SlimSAM yields significant performance improvements while demanding over 10 times less training data than any other existing compression methods. Even when compared to the original SAM, SlimSAM achieves approaching performance while reducing parameter counts to merely 1.4% (9.1M), MACs to 0.8% (23G), and requiring only 0.1% (10k) of the SAM training data. The code is available at http://github.com/czg1225/SlimSAM.

SlimSAM: 0.1% Data Makes Segment Anything Slim

TL;DR

and

of the original) while using only

of the training data, with inference speedups up to

on a Titan RTX. The approach outperforms existing SAM compression methods under the same or lower data budgets, demonstrating that effective knowledge inheritance from a pre-trained SAM can be preserved with minimal data and compute.

Abstract

Paper Structure (18 sections, 8 equations, 11 figures, 13 tables)

This paper contains 18 sections, 8 equations, 11 figures, 13 tables.

Introduction
Related Works
Methods
Identifying SAM Redundancy
Alternate Slimming.
Experiments
Experimental Settings
Comparision and Analysis
Ablation Study and Analysis
Conclusion
Appendix
Ablation Study on SlimSAM-50
More Analysis on Efficiency
More Analysis on Training Costs
More Analysis on Dynamic Loss
...and 3 more sections

Figures (11)

Figure 1: A simple overall diagram of the proposed alternate slimming process.
Figure 2: The provided figure depicts our alternate slimming process with a 50% pruning ratio on SAM-B. We utilize structural pruning at the channel-wise group level to compress SAM's image encoder, coupled with knowledge distillation from intermediate layers to restore the pruned encoder. The red numbers highlight the pruned dimensions at each pruning step.
Figure 3: Training results on SA-1B with the common one-step method and our alternate slimming framework. Left and right are results with disturbed Taylor importance and random importance.
Figure 4: The intermediate dimensions of QVK Attention (top row) and MLP (bottom row) within each ViT after pruning. We present the outcomes of local pruning and global pruning under five distinct normalization methods.
Figure 5: The intermediate dimensions of QKV Attention (top row) and MLP (bottom row) within each ViT after pruning. We present the outcomes of local pruning and global pruning under five distinct normalization methods.
...and 6 more figures

SlimSAM: 0.1% Data Makes Segment Anything Slim

TL;DR

Abstract

SlimSAM: 0.1% Data Makes Segment Anything Slim

Authors

TL;DR

Abstract

Table of Contents

Figures (11)