Efficient Segment Anything with Depth-Aware Fusion and Limited Training Data

Yiming Zhou; Xuenjie Xie; Panfeng Li; Albrecht Kunz; Ahmad Osman; Xavier Maldague

Efficient Segment Anything with Depth-Aware Fusion and Limited Training Data

Yiming Zhou, Xuenjie Xie, Panfeng Li, Albrecht Kunz, Ahmad Osman, Xavier Maldague

TL;DR

This work tackles the data- and compute-heavy nature of Segment Anything Models (SAM) by introducing Depth-Aware EfficientViT-SAM, which injects monocular depth priors into a lightweight RGB backbone. Depth maps from DepthAnything are processed by a dedicated depth encoder and fused with RGB features through additive fusion, enabling improved boundary delineation with limited training data. Trained on only $11.2k$ images (less than 0.1% of SA-1B) over 4 epochs, the approach surpasses EfficientViT-SAM in zero-shot accuracy and delivers competitive results in box- and point-prompted settings while remaining far lighter than SAM-ViT-H. The findings demonstrate that depth priors provide strong geometric guidance, offering practical gains for real-time segmentation on resource-constrained devices.

Abstract

Segment Anything Models (SAM) achieve impressive universal segmentation performance but require massive datasets (e.g., 11M images) and rely solely on RGB inputs. Recent efficient variants reduce computation but still depend on large-scale training. We propose a lightweight RGB-D fusion framework that augments EfficientViT-SAM with monocular depth priors. Depth maps are generated with a pretrained estimator and fused mid-level with RGB features through a dedicated depth encoder. Trained on only 11.2k samples (less than 0.1\% of SA-1B), our method achieves higher accuracy than EfficientViT-SAM, showing that depth cues provide strong geometric priors for segmentation.

Efficient Segment Anything with Depth-Aware Fusion and Limited Training Data

TL;DR

images (less than 0.1% of SA-1B) over 4 epochs, the approach surpasses EfficientViT-SAM in zero-shot accuracy and delivers competitive results in box- and point-prompted settings while remaining far lighter than SAM-ViT-H. The findings demonstrate that depth priors provide strong geometric guidance, offering practical gains for real-time segmentation on resource-constrained devices.

Abstract

Paper Structure (14 sections, 3 equations, 2 figures, 3 tables)

This paper contains 14 sections, 3 equations, 2 figures, 3 tables.

Introduction
Related Work
Efficient Segment Anything Models
RGB-D Segmentation
Method
Model Architecture
Loss Function
Experiments
Experimental Settings
Runtime Efficiency
Qualitative Results
Zero-Shot Box-Prompted Segmentation
Zero-Shot Point-Prompted Segmentation
Conclusion

Figures (2)

Figure 1: Overview of our framework. Depth maps are estimated with DepthAnything yang2024depth and encoded alongside RGB features using identical encoder architectures. Their fused embedding is processed by the SAM head, consisting of a prompt encoder and mask decoder, to produce the final segmentation.
Figure 2: Qualitative comparison of point-prompted segmentation results. Each row shows one example, with (a) the input image, (b) the estimated depth map, (c) the segmentation produced by EfficientViT-SAM, and (d) the segmentation produced by our Depth-Aware EfficientViT-SAM.

Efficient Segment Anything with Depth-Aware Fusion and Limited Training Data

TL;DR

Abstract

Efficient Segment Anything with Depth-Aware Fusion and Limited Training Data

Authors

TL;DR

Abstract

Table of Contents

Figures (2)