Table of Contents
Fetching ...

SAM-DAQ: Segment Anything Model with Depth-guided Adaptive Queries for RGB-D Video Salient Object Detection

Jia Lin, Xiaofei Zhou, Jiyuan Liu, Runmin Cong, Guodao Zhang, Zhi Liu, Jiyong Zhang

TL;DR

Facing RGB-D video salient object detection, the paper identifies three barriers to applying the Segment Anything Model directly: reliance on prompts, high memory cost from sequential adapters, and expensive memory attention. It introduces SAM-DAQ, which couples a depth-guided parallel adapter-based multi-modal encoder (PAMIE) with a query-driven temporal memory (QTM) to enable prompt-free, depth-aware video segmentation while keeping training memory low. The approach achieves state-of-the-art results on three RGB-D VSOD benchmarks, outperforming diverse baselines and demonstrating effective RGB-D fusion and temporal modeling. By leveraging a vision foundation model with memory-efficient adapters and learnable queries, the work offers a scalable pathway to deploying foundation models for RGB-D video understanding.

Abstract

Recently segment anything model (SAM) has attracted widespread concerns, and it is often treated as a vision foundation model for universal segmentation. Some researchers have attempted to directly apply the foundation model to the RGB-D video salient object detection (RGB-D VSOD) task, which often encounters three challenges, including the dependence on manual prompts, the high memory consumption of sequential adapters, and the computational burden of memory attention. To address the limitations, we propose a novel method, namely Segment Anything Model with Depth-guided Adaptive Queries (SAM-DAQ), which adapts SAM2 to pop-out salient objects from videos by seamlessly integrating depth and temporal cues within a unified framework. Firstly, we deploy a parallel adapter-based multi-modal image encoder (PAMIE), which incorporates several depth-guided parallel adapters (DPAs) in a skip-connection way. Remarkably, we fine-tune the frozen SAM encoder under prompt-free conditions, where the DPA utilizes depth cues to facilitate the fusion of multi-modal features. Secondly, we deploy a query-driven temporal memory (QTM) module, which unifies the memory bank and prompt embeddings into a learnable pipeline. Concretely, by leveraging both frame-level queries and video-level queries simultaneously, the QTM module can not only selectively extract temporal consistency features but also iteratively update the temporal representations of the queries. Extensive experiments are conducted on three RGB-D VSOD datasets, and the results show that the proposed SAM-DAQ consistently outperforms state-of-the-art methods in terms of all evaluation metrics.

SAM-DAQ: Segment Anything Model with Depth-guided Adaptive Queries for RGB-D Video Salient Object Detection

TL;DR

Facing RGB-D video salient object detection, the paper identifies three barriers to applying the Segment Anything Model directly: reliance on prompts, high memory cost from sequential adapters, and expensive memory attention. It introduces SAM-DAQ, which couples a depth-guided parallel adapter-based multi-modal encoder (PAMIE) with a query-driven temporal memory (QTM) to enable prompt-free, depth-aware video segmentation while keeping training memory low. The approach achieves state-of-the-art results on three RGB-D VSOD benchmarks, outperforming diverse baselines and demonstrating effective RGB-D fusion and temporal modeling. By leveraging a vision foundation model with memory-efficient adapters and learnable queries, the work offers a scalable pathway to deploying foundation models for RGB-D video understanding.

Abstract

Recently segment anything model (SAM) has attracted widespread concerns, and it is often treated as a vision foundation model for universal segmentation. Some researchers have attempted to directly apply the foundation model to the RGB-D video salient object detection (RGB-D VSOD) task, which often encounters three challenges, including the dependence on manual prompts, the high memory consumption of sequential adapters, and the computational burden of memory attention. To address the limitations, we propose a novel method, namely Segment Anything Model with Depth-guided Adaptive Queries (SAM-DAQ), which adapts SAM2 to pop-out salient objects from videos by seamlessly integrating depth and temporal cues within a unified framework. Firstly, we deploy a parallel adapter-based multi-modal image encoder (PAMIE), which incorporates several depth-guided parallel adapters (DPAs) in a skip-connection way. Remarkably, we fine-tune the frozen SAM encoder under prompt-free conditions, where the DPA utilizes depth cues to facilitate the fusion of multi-modal features. Secondly, we deploy a query-driven temporal memory (QTM) module, which unifies the memory bank and prompt embeddings into a learnable pipeline. Concretely, by leveraging both frame-level queries and video-level queries simultaneously, the QTM module can not only selectively extract temporal consistency features but also iteratively update the temporal representations of the queries. Extensive experiments are conducted on three RGB-D VSOD datasets, and the results show that the proposed SAM-DAQ consistently outperforms state-of-the-art methods in terms of all evaluation metrics.

Paper Structure

This paper contains 16 sections, 6 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: High-level illustration of our SAM-DAQ.
  • Figure 2: The overall architecture of the proposed Segment Anything Model with Depth-guided Adaptive Queries (SAM-DAQ) of a single frame.
  • Figure 3: Qualitative comparison with the state-of-the-art RGB-D video salient object detection models on RDVS dataset.
  • Figure 4: Ablation studies of different query numbers.