Table of Contents
Fetching ...

Language-Guided Structure-Aware Network for Camouflaged Object Detection

Min Zhang

Abstract

Camouflaged Object Detection (COD) aims to segment objects that are highly integrated with the background in terms of color, texture, and structure, making it a highly challenging task in computer vision. Although existing methods introduce multi-scale fusion and attention mechanisms to alleviate the above issues, they generally lack the guidance of textual semantic priors, which limits the model's ability to focus on camouflaged regions in complex scenes. To address this issue, this paper proposes a Language-Guided Structure-Aware Network (LGSAN). Specifically, based on the visual backbone PVT-v2, we introduce CLIP to generate masks from text prompts and RGB images, thereby guiding the multi-scale features extracted by PVT-v2 to focus on potential target regions. On this foundation, we further design a Fourier Edge Enhancement Module (FEEM), which integrates multi-scale features with high-frequency information in the frequency domain to extract edge enhancement features. Furthermore, we propose a Structure-Aware Attention Module (SAAM) to effectively enhance the model's perception of object structures and boundaries. Finally, we introduce a Coarse-Guided Local Refinement Module (CGLRM) to enhance fine-grained reconstruction and boundary integrity of camouflaged object regions. Extensive experiments demonstrate that our method consistently achieves highly competitive performance across multiple COD datasets, validating its effectiveness and robustness.

Language-Guided Structure-Aware Network for Camouflaged Object Detection

Abstract

Camouflaged Object Detection (COD) aims to segment objects that are highly integrated with the background in terms of color, texture, and structure, making it a highly challenging task in computer vision. Although existing methods introduce multi-scale fusion and attention mechanisms to alleviate the above issues, they generally lack the guidance of textual semantic priors, which limits the model's ability to focus on camouflaged regions in complex scenes. To address this issue, this paper proposes a Language-Guided Structure-Aware Network (LGSAN). Specifically, based on the visual backbone PVT-v2, we introduce CLIP to generate masks from text prompts and RGB images, thereby guiding the multi-scale features extracted by PVT-v2 to focus on potential target regions. On this foundation, we further design a Fourier Edge Enhancement Module (FEEM), which integrates multi-scale features with high-frequency information in the frequency domain to extract edge enhancement features. Furthermore, we propose a Structure-Aware Attention Module (SAAM) to effectively enhance the model's perception of object structures and boundaries. Finally, we introduce a Coarse-Guided Local Refinement Module (CGLRM) to enhance fine-grained reconstruction and boundary integrity of camouflaged object regions. Extensive experiments demonstrate that our method consistently achieves highly competitive performance across multiple COD datasets, validating its effectiveness and robustness.

Paper Structure

This paper contains 24 sections, 24 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: The architecture of LGSAN. The overall framework of the model consists of five key components: the PVT-v2 backbone, the CLIP backbone, the FEEM, the SAAM, and the CGRLM. Refer to Section 3 for details.
  • Figure 2: The architecture of the FEEM. The FEEM generates edge enhancement features through multi-scale fusion, edge enhancement, and high-frequency modeling in the frequency domain.
  • Figure 3: The SAAM introduces semantic information of camouflaged objects into a lightweight attention framework to highlight camouflaged regions, while incorporating edge enhancement features to emphasize boundary information, thereby enabling the model to focus on structural and boundary details of camouflaged objects at high resolution.
  • Figure 4: The CGLRM employs channel and spatial attention to obtain global guidance, performs local refinement through 2×2 spatial partitioning, thereby ensuring structural consistency and boundary integrity.
  • Figure 5: The heatmaps of LGSAN from $O_4$ to $O_1$ illustrate the progressive refinement process, while the heatmaps of $M_1$ and $O_e$ are also presented for comparison.
  • ...and 1 more figures