Table of Contents
Fetching ...

X-Prompt: Multi-modal Visual Prompt for Video Object Segmentation

Pinxue Guo, Wanyun Li, Hao Huang, Lingyi Hong, Xinyu Zhou, Zhaoyu Chen, Jinglun Li, Kaixun Jiang, Wei Zhang, Wenqiang Zhang

TL;DR

X-Prompt introduces a universal RGB+X framework for multi-modal video object segmentation by treating the auxiliary modality as a visual prompt to a pretrained RGB VOS foundation model. It comprises the Multi-modal Visual Prompter (MVP), which generates cross-modal prompts, and Multi-modal Adaptation Experts (MAEs), which inject modality-specific knowledge via low-rank adapters while keeping the foundation model frozen. Trained first on RGB data and then adapted to RGB-T, RGB-D, and RGB-E with limited data, X-Prompt achieves state-of-the-art results across 3 tasks and 4 benchmarks, demonstrating strong generalization and reduced task-specific design costs. The approach offers practical impact by enabling robust multi-modal segmentation with lower hardware and data requirements, and the authors release code for reproducibility, highlighting the framework's potential for broad adoption in video understanding applications.

Abstract

Multi-modal Video Object Segmentation (VOS), including RGB-Thermal, RGB-Depth, and RGB-Event, has garnered attention due to its capability to address challenging scenarios where traditional VOS methods struggle, such as extreme illumination, rapid motion, and background distraction. Existing approaches often involve designing specific additional branches and performing full-parameter fine-tuning for fusion in each task. However, this paradigm not only duplicates research efforts and hardware costs but also risks model collapse with the limited multi-modal annotated data. In this paper, we propose a universal framework named X-Prompt for all multi-modal video object segmentation tasks, designated as RGB+X. The X-Prompt framework first pre-trains a video object segmentation foundation model using RGB data, and then utilize the additional modality of the prompt to adapt it to downstream multi-modal tasks with limited data. Within the X-Prompt framework, we introduce the Multi-modal Visual Prompter (MVP), which allows prompting foundation model with the various modalities to segment objects precisely. We further propose the Multi-modal Adaptation Experts (MAEs) to adapt the foundation model with pluggable modality-specific knowledge without compromising the generalization capacity. To evaluate the effectiveness of the X-Prompt framework, we conduct extensive experiments on 3 tasks across 4 benchmarks. The proposed universal X-Prompt framework consistently outperforms the full fine-tuning paradigm and achieves state-of-the-art performance. Code: https://github.com/PinxueGuo/X-Prompt.git

X-Prompt: Multi-modal Visual Prompt for Video Object Segmentation

TL;DR

X-Prompt introduces a universal RGB+X framework for multi-modal video object segmentation by treating the auxiliary modality as a visual prompt to a pretrained RGB VOS foundation model. It comprises the Multi-modal Visual Prompter (MVP), which generates cross-modal prompts, and Multi-modal Adaptation Experts (MAEs), which inject modality-specific knowledge via low-rank adapters while keeping the foundation model frozen. Trained first on RGB data and then adapted to RGB-T, RGB-D, and RGB-E with limited data, X-Prompt achieves state-of-the-art results across 3 tasks and 4 benchmarks, demonstrating strong generalization and reduced task-specific design costs. The approach offers practical impact by enabling robust multi-modal segmentation with lower hardware and data requirements, and the authors release code for reproducibility, highlighting the framework's potential for broad adoption in video understanding applications.

Abstract

Multi-modal Video Object Segmentation (VOS), including RGB-Thermal, RGB-Depth, and RGB-Event, has garnered attention due to its capability to address challenging scenarios where traditional VOS methods struggle, such as extreme illumination, rapid motion, and background distraction. Existing approaches often involve designing specific additional branches and performing full-parameter fine-tuning for fusion in each task. However, this paradigm not only duplicates research efforts and hardware costs but also risks model collapse with the limited multi-modal annotated data. In this paper, we propose a universal framework named X-Prompt for all multi-modal video object segmentation tasks, designated as RGB+X. The X-Prompt framework first pre-trains a video object segmentation foundation model using RGB data, and then utilize the additional modality of the prompt to adapt it to downstream multi-modal tasks with limited data. Within the X-Prompt framework, we introduce the Multi-modal Visual Prompter (MVP), which allows prompting foundation model with the various modalities to segment objects precisely. We further propose the Multi-modal Adaptation Experts (MAEs) to adapt the foundation model with pluggable modality-specific knowledge without compromising the generalization capacity. To evaluate the effectiveness of the X-Prompt framework, we conduct extensive experiments on 3 tasks across 4 benchmarks. The proposed universal X-Prompt framework consistently outperforms the full fine-tuning paradigm and achieves state-of-the-art performance. Code: https://github.com/PinxueGuo/X-Prompt.git
Paper Structure (17 sections, 13 equations, 4 figures, 7 tables)

This paper contains 17 sections, 13 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: (a), (b), and (c) current RGB-T, RGB-D, and RGB-E video object segmentation paradigms. (d) X-Prompt, a unified framework for RGB+X multi-modal video object segmentation tasks.
  • Figure 2: The overall architecture of universal X-Prompt Framework for RGB-X multi-modal video object segmentation tasks. Following the pre-training of an RGB VOS foundation model (Sec. \ref{['sec:preliminary']}) with robust segmentation capabilities and generalization, X-Prompt (Sec. \ref{['sec:framework']}) utilizes the X-modality to prompt and adapt the foundation model for various downstream multi-modal tasks, employing our proposed Multi-modal Visual Prompter (Sec. \ref{['sec:prompter']}) and Multi-modal Adaptation Experts (Sec. \ref{['sec:expert']}).
  • Figure 3: The design of the Multi-modal Visual Prompter (MVP) for encoding the spatial-modal attended complementary prompt embedding for the foundation model and the multi-scale multi-modal prompt embedding for the mask decoder.
  • Figure 4: Qualitative results of RGB-X. X-Prompt effectively utilizes X-modality to address challenging scenarios.