Table of Contents
Fetching ...

RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation

Yonglin Li, Jing Zhang, Xiao Teng, Long Lan, Xinwang Liu

TL;DR

The RefSAM model is presented, which explores the potential of SAM for RVOS by incorporating multi-view information from diverse modalities and successive frames at different timestamps in an online manner by employing a lightweight Cross-Modal MLP that projects the text embedding of the referring expression into sparse and dense embeddings, serving as user-interactive prompts.

Abstract

The Segment Anything Model (SAM) has gained significant attention for its impressive performance in image segmentation. However, it lacks proficiency in referring video object segmentation (RVOS) due to the need for precise user-interactive prompts and a limited understanding of different modalities, such as language and vision. This paper presents the RefSAM model, which explores the potential of SAM for RVOS by incorporating multi-view information from diverse modalities and successive frames at different timestamps in an online manner. Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-Modal MLP that projects the text embedding of the referring expression into sparse and dense embeddings, serving as user-interactive prompts. Additionally, we have introduced the hierarchical dense attention module to fuse hierarchical visual semantic information with sparse embeddings to obtain fine-grained dense embeddings, and an implicit tracking module to generate a tracking token and provide historical information for the mask decoder. Furthermore, we employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively. Through comprehensive ablation studies, we demonstrate our model's practical and effective design choices. Extensive experiments conducted on Refer-Youtube-VOS, Ref-DAVIS17, and three referring image segmentation datasets validate the superiority and effectiveness of our RefSAM model over existing methods.

RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation

TL;DR

The RefSAM model is presented, which explores the potential of SAM for RVOS by incorporating multi-view information from diverse modalities and successive frames at different timestamps in an online manner by employing a lightweight Cross-Modal MLP that projects the text embedding of the referring expression into sparse and dense embeddings, serving as user-interactive prompts.

Abstract

The Segment Anything Model (SAM) has gained significant attention for its impressive performance in image segmentation. However, it lacks proficiency in referring video object segmentation (RVOS) due to the need for precise user-interactive prompts and a limited understanding of different modalities, such as language and vision. This paper presents the RefSAM model, which explores the potential of SAM for RVOS by incorporating multi-view information from diverse modalities and successive frames at different timestamps in an online manner. Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-Modal MLP that projects the text embedding of the referring expression into sparse and dense embeddings, serving as user-interactive prompts. Additionally, we have introduced the hierarchical dense attention module to fuse hierarchical visual semantic information with sparse embeddings to obtain fine-grained dense embeddings, and an implicit tracking module to generate a tracking token and provide historical information for the mask decoder. Furthermore, we employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively. Through comprehensive ablation studies, we demonstrate our model's practical and effective design choices. Extensive experiments conducted on Refer-Youtube-VOS, Ref-DAVIS17, and three referring image segmentation datasets validate the superiority and effectiveness of our RefSAM model over existing methods.
Paper Structure (32 sections, 10 equations, 7 figures, 6 tables)

This paper contains 32 sections, 10 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: RefSAM integrates multi-view information from diverse modalities and successive frames at different timestamps in an online manner. Similar to SAM, RefSAM generates three types of token outputs: IOU token output (green), main mask token output (light blue), and three-scale mask token outputs (dark blue). We feed the main mask token output into RefSAM's tracking module to obtain the track token that can provide historical information for the mask decoder and assist RefSAM in predicting the mask for the next frame.
  • Figure 2: The overall pipeline of RefSAM. It mainly consists of five key components: 1) Backbone: Visual Encoder of SAM kirillov2023segment with Adapter and Text Encoder; 2) Cross-Modal MLP; 3) Hierarchical Dense Attention; 4) Mask Decoder of SAM; and 5) Implicit Tracking Module. We construct cross-modal Sparse Embeddings and Dense Embeddings to learn text-visual information and predict masks. We use the implicit tracking module to generate a track token and provide historical information for the mask decoder.
  • Figure 3: The structure of HDA. 1) The left part denotes the overall architecture of HDA. The inputs include the Intermediate Embeddings, Visual Embeddings, and Sparse Embeddings. The output is fine-grained Dense Embeddings. 2) The right part denotes the structure of Dense Attention for fusing hierarchical visual semantic information and sparse embeddings.
  • Figure 4: The influence of different learning rates for the learnable modules of RefSAM.
  • Figure 5: The influence of the number of hidden layers in Cross-Modal MLP.
  • ...and 2 more figures