Table of Contents
Fetching ...

SAM3-UNet: Simplified Adaptation of Segment Anything Model 3

Xinyu Xiong, Zihuang Wu, Lei Lu, Yufa Xia

TL;DR

SAM3-UNet addresses SAM3's coarse boundaries and context-dependent failures by retaining the SAM3 encoder, adding adapters for efficient fine-tuning, and employing a lightweight U-Net–style decoder. The architecture compresses SAM3 outputs into hierarchical features and fuses them with a compact decoder to reduce computation while preserving performance. Empirical results on mirror detection and salient object detection show state-of-the-art performance on MSD/PMD and competitive results on DUTS benchmarks, with memory usage under 6 GB during training. This work demonstrates a practical, scalable path for adapting foundation segmentation models to downstream tasks.

Abstract

In this paper, we introduce SAM3-UNet, a simplified variant of Segment Anything Model 3 (SAM3), designed to adapt SAM3 for downstream tasks at a low cost. Our SAM3-UNet consists of three components: a SAM3 image encoder, a simple adapter for parameter-efficient fine-tuning, and a lightweight U-Net-style decoder. Preliminary experiments on multiple tasks, such as mirror detection and salient object detection, demonstrate that the proposed SAM3-UNet outperforms the prior SAM2-UNet and other state-of-the-art methods, while requiring less than 6 GB of GPU memory during training with a batch size of 12. The code is publicly available at https://github.com/WZH0120/SAM3-UNet.

SAM3-UNet: Simplified Adaptation of Segment Anything Model 3

TL;DR

SAM3-UNet addresses SAM3's coarse boundaries and context-dependent failures by retaining the SAM3 encoder, adding adapters for efficient fine-tuning, and employing a lightweight U-Net–style decoder. The architecture compresses SAM3 outputs into hierarchical features and fuses them with a compact decoder to reduce computation while preserving performance. Empirical results on mirror detection and salient object detection show state-of-the-art performance on MSD/PMD and competitive results on DUTS benchmarks, with memory usage under 6 GB during training. This work demonstrates a practical, scalable path for adapting foundation segmentation models to downstream tasks.

Abstract

In this paper, we introduce SAM3-UNet, a simplified variant of Segment Anything Model 3 (SAM3), designed to adapt SAM3 for downstream tasks at a low cost. Our SAM3-UNet consists of three components: a SAM3 image encoder, a simple adapter for parameter-efficient fine-tuning, and a lightweight U-Net-style decoder. Preliminary experiments on multiple tasks, such as mirror detection and salient object detection, demonstrate that the proposed SAM3-UNet outperforms the prior SAM2-UNet and other state-of-the-art methods, while requiring less than 6 GB of GPU memory during training with a batch size of 12. The code is publicly available at https://github.com/WZH0120/SAM3-UNet.

Paper Structure

This paper contains 11 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Overview of the proposed SAM3-UNet. For simplicity, we show only the decoder block where feature fusion is available.
  • Figure 2: Visualization results on mirror detection.
  • Figure 3: Visualization results on salient object detection.