Table of Contents
Fetching ...

Deep Instruction Tuning for Segment Anything Model

Xiaorui Huang, Gen Luo, Chaoyang Zhu, Bo Tong, Yiyi Zhou, Xiaoshuai Sun, Rongrong Ji

TL;DR

This work identifies a key limitation of the Segment Anything Model (SAM) in text-guided referring image segmentation (RIS) due to its shallow fusion of text prompts. It proposes two Deep Instruction Tuning (DIT) strategies—End-to-end DIT (E-DIT) and Layer-wise DIT (L-DIT)—that repurpose SAM’s image encoder as a deep vision-language learner without adding new fusion branches. Empirical results on three RIS benchmarks show that E-DIT delivers large gains over the default SAM, while L-DIT often achieves state-of-the-art performance with reduced data and training costs. The approach offers a fast, budget-friendly path to strong text-conditioned segmentation, with code released for reproducibility and broader adoption.

Abstract

Recently, Segment Anything Model (SAM) has become a research hotspot in the fields of multimedia and computer vision, which exhibits powerful yet versatile capabilities on various (un) conditional image segmentation tasks. Although SAM can support different types of segmentation prompts, we note that, compared to point- and box-guided segmentations, it performs much worse on text-instructed tasks, e.g., referring image segmentation (RIS). In this paper, we argue that deep text instruction tuning is key to mitigate such shortcoming caused by the shallow fusion scheme in its default light-weight mask decoder. To address this issue, we propose two simple yet effective deep instruction tuning (DIT) methods for SAM, one is end-to-end and the other is layer-wise. With minimal modifications, DITs can directly transform the image encoder of SAM as a stand-alone vision-language learner in contrast to building another deep fusion branch, maximizing the benefit of its superior segmentation capability. Extensive experiments on three highly competitive benchmark datasets of RIS show that a simple end-to-end DIT can improve SAM by a large margin, while the layer-wise DIT can further boost the performance to state-of-the-art with much less data and training expenditures. Our code is released at: https://github.com/wysnzzzz/DIT.

Deep Instruction Tuning for Segment Anything Model

TL;DR

This work identifies a key limitation of the Segment Anything Model (SAM) in text-guided referring image segmentation (RIS) due to its shallow fusion of text prompts. It proposes two Deep Instruction Tuning (DIT) strategies—End-to-end DIT (E-DIT) and Layer-wise DIT (L-DIT)—that repurpose SAM’s image encoder as a deep vision-language learner without adding new fusion branches. Empirical results on three RIS benchmarks show that E-DIT delivers large gains over the default SAM, while L-DIT often achieves state-of-the-art performance with reduced data and training costs. The approach offers a fast, budget-friendly path to strong text-conditioned segmentation, with code released for reproducibility and broader adoption.

Abstract

Recently, Segment Anything Model (SAM) has become a research hotspot in the fields of multimedia and computer vision, which exhibits powerful yet versatile capabilities on various (un) conditional image segmentation tasks. Although SAM can support different types of segmentation prompts, we note that, compared to point- and box-guided segmentations, it performs much worse on text-instructed tasks, e.g., referring image segmentation (RIS). In this paper, we argue that deep text instruction tuning is key to mitigate such shortcoming caused by the shallow fusion scheme in its default light-weight mask decoder. To address this issue, we propose two simple yet effective deep instruction tuning (DIT) methods for SAM, one is end-to-end and the other is layer-wise. With minimal modifications, DITs can directly transform the image encoder of SAM as a stand-alone vision-language learner in contrast to building another deep fusion branch, maximizing the benefit of its superior segmentation capability. Extensive experiments on three highly competitive benchmark datasets of RIS show that a simple end-to-end DIT can improve SAM by a large margin, while the layer-wise DIT can further boost the performance to state-of-the-art with much less data and training expenditures. Our code is released at: https://github.com/wysnzzzz/DIT.
Paper Structure (26 sections, 8 equations, 5 figures, 9 tables)

This paper contains 26 sections, 8 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Illustration of the default fusion of SAM (a) and our deep instruction tuning (DIT) methods (b), i.e., the end-to-end (red) and layer-wise (blue) ones. Compared with the default shallow fusion, DITs regard the visual encoder as an deep multi-modal learner for cross-modal fusion and visual feature learning, thereby achieving the full interactions between text and image for text-guided SAM.
  • Figure 2: The comparison among the default fusion scheme of SAM (with ViT-B) and our deep instruction tuning methods on RefCOCO yu2016modeling, i.e., the end-to-end and layer-wise ones, respectively. In SAM, the text prompts only interact with the visual tokens in the shallow mask decoder, which is insufficient for cross-modal fusion and text-guided segmentation. Our simple deep tuning methods can improve the performance by a large margin without significantly changing the architecture of SAM.
  • Figure 3: Overall pipeline of Deep Instruction Tuning (DIT). We propose (a) End-to-end DIT (E-DIT) that appends language instructions before visual features to achieve early-fusion for enhancing multi-modal reasoning capability; (b) Layer-wise DIT (L-DIT) that layer-wisely projects to text tokens interact with the existing sequences in the well learned semantic space.
  • Figure 4: Predictions of the default SAM and our end-to-end and layer-wise DITs, i.e., E-DIT and L-DIT. Both methods can follow text instructions better than the default SAM, while L-DIT can accurately segment the referent even with complex text content or image background. For better comparison, we use red boxes to ground the incorrect segmentations.
  • Figure 5: The attention maps of the default SAM and our end-to-end and layer-wise DITs, i.e., E-DIT and L-DIT. Both methods can follow text instructions better than the default SAM and adjust the area of interest according to text instructions.