Table of Contents
Fetching ...

Camouflage-aware Image-Text Retrieval via Expert Collaboration

Yao Jiang, Zhongkuan Mao, Xuan Wu, Keren Fu, Qijun Zhao

Abstract

Camouflaged scene understanding (CSU) has attracted significant attention due to its broad practical implications. However, in this field, robust image-text cross-modal alignment remains under-explored, hindering deeper understanding of camouflaged scenarios and their related applications. To this end, we focus on the typical image-text retrieval task, and formulate a new task dubbed ``camouflage-aware image-text retrieval'' (CA-ITR). We first construct a dedicated camouflage image-text retrieval dataset (CamoIT), comprising $\sim$10.5K samples with multi-granularity textual annotations. Benchmark results conducted on CamoIT reveal the underlying challenges of CA-ITR for existing cutting-edge retrieval techniques, which are mainly caused by objects' camouflage properties as well as those complex image contents. As a solution, we propose a camouflage-expert collaborative network (CECNet), which features a dual-branch visual encoder: one branch captures holistic image representations, while the other incorporates a dedicated model to inject representations of camouflaged objects. A novel confidence-conditioned graph attention (C\textsuperscript{2}GA) mechanism is incorporated to exploit the complementarity across branches. Comparative experiments show that CECNet achieves $\sim$29% overall CA-ITR accuracy boost, surpassing seven representative retrieval models. The dataset and code will be available at https://github.com/jiangyao-scu/CA-ITR.

Camouflage-aware Image-Text Retrieval via Expert Collaboration

Abstract

Camouflaged scene understanding (CSU) has attracted significant attention due to its broad practical implications. However, in this field, robust image-text cross-modal alignment remains under-explored, hindering deeper understanding of camouflaged scenarios and their related applications. To this end, we focus on the typical image-text retrieval task, and formulate a new task dubbed ``camouflage-aware image-text retrieval'' (CA-ITR). We first construct a dedicated camouflage image-text retrieval dataset (CamoIT), comprising 10.5K samples with multi-granularity textual annotations. Benchmark results conducted on CamoIT reveal the underlying challenges of CA-ITR for existing cutting-edge retrieval techniques, which are mainly caused by objects' camouflage properties as well as those complex image contents. As a solution, we propose a camouflage-expert collaborative network (CECNet), which features a dual-branch visual encoder: one branch captures holistic image representations, while the other incorporates a dedicated model to inject representations of camouflaged objects. A novel confidence-conditioned graph attention (C\textsuperscript{2}GA) mechanism is incorporated to exploit the complementarity across branches. Comparative experiments show that CECNet achieves 29% overall CA-ITR accuracy boost, surpassing seven representative retrieval models. The dataset and code will be available at https://github.com/jiangyao-scu/CA-ITR.

Paper Structure

This paper contains 17 sections, 3 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Qualitative results of three state-of-the-art (SOTA) retrieval methods (i.e., CLIP radford2021clip, AVSE liu2025AVSE, and D2S-VSE liu2025D2SVSE) on CA-ITR. Camouflaged objects in the images are marked with red bounding boxes. Below the dotted line are samples from the general ITR datasets (i.e., MS-COCO lin2014coco and Flickr30K young2014flickr).
  • Figure 2: Data annotation process, an example, and statistical analyses of CamoIT.
  • Figure 3: The overall pipeline of the proposed CECNet and C2GA.
  • Figure 4: Qualitative results of CECNet (top, light purple) and CLIP (bottom, light orange) on sentence retrieval (left) and image retrieval (right). For each query, we present the top three relevant cross-modal instances. To enhance readability, we retain only the essential components of the sentence. The accurate, inaccurate, and ambiguous portions of the sentences in the search results are highlighted in green, red, and gray, respectively. Camouflaged objects are marked with red bounding boxes.