Table of Contents
Fetching ...

Integrating Extra Modality Helps Segmentor Find Camouflaged Objects Well

Chengyu Fang, Chunming He, Longxiang Tang, Yuelin Zhang, Chenyang Zhu, Yuqi Shen, Chubin Chen, Guoxia Xu, Xiu Li

TL;DR

Camouflaged Object Segmentation (COS) is hard due to subtle foreground-background differences, and single-modality RGB guidance is often insufficient. The proposed MultiCOS framework combines BFSer, which fuses modalities in latent and state spaces with a fusion-feedback loop, and CKLer, which learns cross-modal knowledge from external multimodal datasets to generate pseudo-modal inputs and guidance, enabling segmentation improvements even when real multimodal COS data are scarce. Across RGB-I, RGB-D, and RGB-P benchmarks, MultiCOS achieves state-of-the-art results and demonstrates robustness to misalignment and data scarcity, while remaining plug-and-play for existing COS models. These advances offer a practical path to exploit diverse modalities in real-world COS applications, with potential extensions to more modalities and coordinated multi-task learning on translation and refinement networks.

Abstract

Camouflaged Object Segmentation (COS) remains challenging because camouflaged objects exhibit only subtle visual differences from their backgrounds and single-modality RGB methods provide limited cues, leading researchers to explore multimodal data to improve segmentation accuracy. In this work, we presenet MultiCOS, a novel framework that effectively leverages diverse data modalities to improve segmentation performance. MultiCOS comprises two modules: Bi-space Fusion Segmentor (BFSer), which employs a state space and a latent space fusion mechanism to integrate cross-modal features within a shared representation and employs a fusion-feedback mechanism to refine context-specific features, and Cross-modal Knowledge Learner (CKLer), which leverages external multimodal datasets to generate pseudo-modal inputs and establish cross-modal semantic associations, transferring knowledge to COS models when real multimodal pairs are missing. When real multimodal COS data are unavailable, CKLer yields additional segmentation gains using only non-COS multimodal sources. Experiments on standard COS benchmarks show that BFSer outperforms existing multimodal baselines with both real and pseudo-modal data. Code will be released at \href{https://github.com/cnyvfang/MultiCOS}{GitHub}.

Integrating Extra Modality Helps Segmentor Find Camouflaged Objects Well

TL;DR

Camouflaged Object Segmentation (COS) is hard due to subtle foreground-background differences, and single-modality RGB guidance is often insufficient. The proposed MultiCOS framework combines BFSer, which fuses modalities in latent and state spaces with a fusion-feedback loop, and CKLer, which learns cross-modal knowledge from external multimodal datasets to generate pseudo-modal inputs and guidance, enabling segmentation improvements even when real multimodal COS data are scarce. Across RGB-I, RGB-D, and RGB-P benchmarks, MultiCOS achieves state-of-the-art results and demonstrates robustness to misalignment and data scarcity, while remaining plug-and-play for existing COS models. These advances offer a practical path to exploit diverse modalities in real-world COS applications, with potential extensions to more modalities and coordinated multi-task learning on translation and refinement networks.

Abstract

Camouflaged Object Segmentation (COS) remains challenging because camouflaged objects exhibit only subtle visual differences from their backgrounds and single-modality RGB methods provide limited cues, leading researchers to explore multimodal data to improve segmentation accuracy. In this work, we presenet MultiCOS, a novel framework that effectively leverages diverse data modalities to improve segmentation performance. MultiCOS comprises two modules: Bi-space Fusion Segmentor (BFSer), which employs a state space and a latent space fusion mechanism to integrate cross-modal features within a shared representation and employs a fusion-feedback mechanism to refine context-specific features, and Cross-modal Knowledge Learner (CKLer), which leverages external multimodal datasets to generate pseudo-modal inputs and establish cross-modal semantic associations, transferring knowledge to COS models when real multimodal pairs are missing. When real multimodal COS data are unavailable, CKLer yields additional segmentation gains using only non-COS multimodal sources. Experiments on standard COS benchmarks show that BFSer outperforms existing multimodal baselines with both real and pseudo-modal data. Code will be released at \href{https://github.com/cnyvfang/MultiCOS}{GitHub}.

Paper Structure

This paper contains 27 sections, 25 equations, 10 figures, 14 tables.

Figures (10)

  • Figure 1: Training across different data scenarios: real data, generated data, and our MultiCOS. UR-RGB and UR-Modal means task-unrelated RGB image and corresponding multimodal data.
  • Figure 2: Framework of our MultiCOS, and the details of FFM, LSFM, $g_w$, and SSFM. The modules outlined by dashed lines mean the modules introduced by CKLer.
  • Figure 3: Details of our proposed CSSM.
  • Figure 3: Quantitative comparisons of PCOD.
  • Figure 4: Qualitative results of MultiCOS$\dagger$ on RGB-I and other cutting-edge methods.
  • ...and 5 more figures