Table of Contents
Fetching ...

Referring Camouflaged Object Detection With Multi-Context Overlapped Windows Cross-Attention

Yu Wen, Shuyong Gao, Shuping Zhang, Miao Huang, Lili Tao, Han Yang, Haozhe Xing, Lihe Zhang, Boxue Hou

TL;DR

This work addresses referring camouflaged object detection (Ref-COD) by introducing RFMNet, a two-branch model that fuses multi-context reference features with camouflage features. A novel overlapped windows cross-attention mechanism focuses on local region matching for image references, while a text-guided Referring Object Enhancement module leverages textual cues; a Referring Feature Aggregation (RFA) module decodes results progressively. The approach achieves state-of-the-art performance on the R2C7K Ref-COD dataset, outperforming prior methods in both quantitative metrics and qualitative segmentation quality. The study demonstrates that rich multi-context reference information, when fused at multiple feature levels and decoded progressively, significantly improves camouflaged object localization and segmentation, with implications for complex multimodal perception tasks.

Abstract

Referring camouflaged object detection (Ref-COD) aims to identify hidden objects by incorporating reference information such as images and text descriptions. Previous research has transformed reference images with salient objects into one-dimensional prompts, yielding significant results. We explore ways to enhance performance through multi-context fusion of rich salient image features and camouflaged object features. Therefore, we propose RFMNet, which utilizes features from multiple encoding stages of the reference salient images and performs interactive fusion with the camouflage features at the corresponding encoding stages. Given that the features in salient object images contain abundant object-related detail information, performing feature fusion within local areas is more beneficial for detecting camouflaged objects. Therefore, we propose an Overlapped Windows Cross-attention mechanism to enable the model to focus more attention on the local information matching based on reference features. Besides, we propose the Referring Feature Aggregation (RFA) module to decode and segment the camouflaged objects progressively. Extensive experiments on the Ref-COD benchmark demonstrate that our method achieves state-of-the-art performance.

Referring Camouflaged Object Detection With Multi-Context Overlapped Windows Cross-Attention

TL;DR

This work addresses referring camouflaged object detection (Ref-COD) by introducing RFMNet, a two-branch model that fuses multi-context reference features with camouflage features. A novel overlapped windows cross-attention mechanism focuses on local region matching for image references, while a text-guided Referring Object Enhancement module leverages textual cues; a Referring Feature Aggregation (RFA) module decodes results progressively. The approach achieves state-of-the-art performance on the R2C7K Ref-COD dataset, outperforming prior methods in both quantitative metrics and qualitative segmentation quality. The study demonstrates that rich multi-context reference information, when fused at multiple feature levels and decoded progressively, significantly improves camouflaged object localization and segmentation, with implications for complex multimodal perception tasks.

Abstract

Referring camouflaged object detection (Ref-COD) aims to identify hidden objects by incorporating reference information such as images and text descriptions. Previous research has transformed reference images with salient objects into one-dimensional prompts, yielding significant results. We explore ways to enhance performance through multi-context fusion of rich salient image features and camouflaged object features. Therefore, we propose RFMNet, which utilizes features from multiple encoding stages of the reference salient images and performs interactive fusion with the camouflage features at the corresponding encoding stages. Given that the features in salient object images contain abundant object-related detail information, performing feature fusion within local areas is more beneficial for detecting camouflaged objects. Therefore, we propose an Overlapped Windows Cross-attention mechanism to enable the model to focus more attention on the local information matching based on reference features. Besides, we propose the Referring Feature Aggregation (RFA) module to decode and segment the camouflaged objects progressively. Extensive experiments on the Ref-COD benchmark demonstrate that our method achieves state-of-the-art performance.

Paper Structure

This paper contains 30 sections, 19 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Comparison of previous work with our method. (a) Fuse the low-dimensional feature from the reference branch with multi-layer feature maps encoded from the camouflaged image. (b) We integrate the multi-context information from both reference features and camouflage map features.
  • Figure 2: The overall architecture of our RFMNet. It is best viewed in color. In the feature extraction stage in green, we use the encoder to extract the camouflaged image features and the reference features, and then in the fusion stage, we use referring information fusion (RIF) modules to integrate the camouflaged features and reference features in multi-context alignment. We propose the overlapped windows cross-attention mechanism for the reference image fusion method (RIF-s). For the reference text fusion method (RIF-t), we propose a text semantics-guided referring object enhancement module. After the fusion stage, the fused features are fed into the referring feature aggregation (RFA) modules to generate the segmentation results progressively.
  • Figure 3: The reference image fusion method: overlapped windows cross-attention mechanism. Note that $\bigoplus$ is the pixel-wise additional operation.
  • Figure 4: The text semantics-guided referring objects enhancement module. Note that ’$Conv$’ represents the $1 \times 1$ convolution block, ‘$cat$’ is concatenation operation, $\bigotimes$ is the matrix multiplication.
  • Figure 5: The referring feature aggregation module. Note that ‘$cat$’ represents the concatenation, ‘$Conv3$’ is $3\times3$ convolution block, $\bigodot$ is the pixel-wise multiplication.
  • ...and 2 more figures