Table of Contents
Fetching ...

RGB-D Video Object Segmentation via Enhanced Multi-store Feature Memory

Boyue Xu, Ruichao Hou, Tongwei Ren, Gangshan Wu

TL;DR

This work tackles RGB-D video object segmentation by mitigating cross-modal fusion gaps and long-term drift through an enhanced multi-store feature memory. It introduces Hierarchical Modality Selection and Fusion (HMSF) to adaptively fuse RGB and depth features, and a segmentation refinement module that leverages the Segment Anything Model (SAM) with spatio-temporal and modality embeddings to produce reliable masks for memory guidance. A memory-management scheme, inspired by Atkinson-Shiffrin memory and powered by HMSF, encodes RGB-D images and segmentation results to sustain robust segmentation across frames. On ARKitTrack, the method delivers state-of-the-art performance, driven by memory-guided fusion and SAM-based refinement, demonstrating strong potential for robust RGB-D VOS in real-world applications.

Abstract

The RGB-Depth (RGB-D) Video Object Segmentation (VOS) aims to integrate the fine-grained texture information of RGB with the spatial geometric clues of depth modality, boosting the performance of segmentation. However, off-the-shelf RGB-D segmentation methods fail to fully explore cross-modal information and suffer from object drift during long-term prediction. In this paper, we propose a novel RGB-D VOS method via multi-store feature memory for robust segmentation. Specifically, we design the hierarchical modality selection and fusion, which adaptively combines features from both modalities. Additionally, we develop a segmentation refinement module that effectively utilizes the Segmentation Anything Model (SAM) to refine the segmentation mask, ensuring more reliable results as memory to guide subsequent segmentation tasks. By leveraging spatio-temporal embedding and modality embedding, mixed prompts and fused images are fed into SAM to unleash its potential in RGB-D VOS. Experimental results show that the proposed method achieves state-of-the-art performance on the latest RGB-D VOS benchmark.

RGB-D Video Object Segmentation via Enhanced Multi-store Feature Memory

TL;DR

This work tackles RGB-D video object segmentation by mitigating cross-modal fusion gaps and long-term drift through an enhanced multi-store feature memory. It introduces Hierarchical Modality Selection and Fusion (HMSF) to adaptively fuse RGB and depth features, and a segmentation refinement module that leverages the Segment Anything Model (SAM) with spatio-temporal and modality embeddings to produce reliable masks for memory guidance. A memory-management scheme, inspired by Atkinson-Shiffrin memory and powered by HMSF, encodes RGB-D images and segmentation results to sustain robust segmentation across frames. On ARKitTrack, the method delivers state-of-the-art performance, driven by memory-guided fusion and SAM-based refinement, demonstrating strong potential for robust RGB-D VOS in real-world applications.

Abstract

The RGB-Depth (RGB-D) Video Object Segmentation (VOS) aims to integrate the fine-grained texture information of RGB with the spatial geometric clues of depth modality, boosting the performance of segmentation. However, off-the-shelf RGB-D segmentation methods fail to fully explore cross-modal information and suffer from object drift during long-term prediction. In this paper, we propose a novel RGB-D VOS method via multi-store feature memory for robust segmentation. Specifically, we design the hierarchical modality selection and fusion, which adaptively combines features from both modalities. Additionally, we develop a segmentation refinement module that effectively utilizes the Segmentation Anything Model (SAM) to refine the segmentation mask, ensuring more reliable results as memory to guide subsequent segmentation tasks. By leveraging spatio-temporal embedding and modality embedding, mixed prompts and fused images are fed into SAM to unleash its potential in RGB-D VOS. Experimental results show that the proposed method achieves state-of-the-art performance on the latest RGB-D VOS benchmark.

Paper Structure

This paper contains 16 sections, 9 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Comparison of the framework between different RGB-D VOS methods. (a) RGB-D VOS methods without memory, which use template to guide fusion and segmentation. (b) The proposed method which use memory to guide fusion and segmentation.
  • Figure 2: The framework of the proposed method, consists of RGB-D fusion and mask generation, segmentation refinement and multi-store memory management.
  • Figure 3: The detail of hierarchical modality selection and fusion.(a) Hierarchical modality selection and fusion used in RGB-D fusion and mask generation module. (b) Hierarchical modality selection and fusion for memory used in multi-store memory management module.
  • Figure 4: The details of modality selection and fusion.
  • Figure 5: The details of segmentation refinement. (a) The details of spatio-temporal embedding, which generates mixed prompts. (b) The details of modality embedding, which generates fused images.
  • ...and 1 more figures