Integrating Extra Modality Helps Segmentor Find Camouflaged Objects Well
Chengyu Fang, Chunming He, Longxiang Tang, Yuelin Zhang, Chenyang Zhu, Yuqi Shen, Chubin Chen, Guoxia Xu, Xiu Li
TL;DR
Camouflaged Object Segmentation (COS) is hard due to subtle foreground-background differences, and single-modality RGB guidance is often insufficient. The proposed MultiCOS framework combines BFSer, which fuses modalities in latent and state spaces with a fusion-feedback loop, and CKLer, which learns cross-modal knowledge from external multimodal datasets to generate pseudo-modal inputs and guidance, enabling segmentation improvements even when real multimodal COS data are scarce. Across RGB-I, RGB-D, and RGB-P benchmarks, MultiCOS achieves state-of-the-art results and demonstrates robustness to misalignment and data scarcity, while remaining plug-and-play for existing COS models. These advances offer a practical path to exploit diverse modalities in real-world COS applications, with potential extensions to more modalities and coordinated multi-task learning on translation and refinement networks.
Abstract
Camouflaged Object Segmentation (COS) remains challenging because camouflaged objects exhibit only subtle visual differences from their backgrounds and single-modality RGB methods provide limited cues, leading researchers to explore multimodal data to improve segmentation accuracy. In this work, we presenet MultiCOS, a novel framework that effectively leverages diverse data modalities to improve segmentation performance. MultiCOS comprises two modules: Bi-space Fusion Segmentor (BFSer), which employs a state space and a latent space fusion mechanism to integrate cross-modal features within a shared representation and employs a fusion-feedback mechanism to refine context-specific features, and Cross-modal Knowledge Learner (CKLer), which leverages external multimodal datasets to generate pseudo-modal inputs and establish cross-modal semantic associations, transferring knowledge to COS models when real multimodal pairs are missing. When real multimodal COS data are unavailable, CKLer yields additional segmentation gains using only non-COS multimodal sources. Experiments on standard COS benchmarks show that BFSer outperforms existing multimodal baselines with both real and pseudo-modal data. Code will be released at \href{https://github.com/cnyvfang/MultiCOS}{GitHub}.
