Table of Contents
Fetching ...

EReCu: Pseudo-label Evolution Fusion and Refinement with Multi-Cue Learning for Unsupervised Camouflage Detection

Shuo Jiang, Gaojia Zhang, Min Tan, Yufei Yin, Gang Pan

Abstract

Unsupervised Camouflaged Object Detection (UCOD) remains a challenging task due to the high intrinsic similarity between target objects and their surroundings, as well as the reliance on noisy pseudo-labels that hinder fine-grained texture learning. While existing refinement strategies aim to alleviate label noise, they often overlook intrinsic perceptual cues, leading to boundary overflow and structural ambiguity. In contrast, learning without pseudo-label guidance yields coarse features with significant detail loss. To address these issues, we propose a unified UCOD framework that enhances both the reliability of pseudo-labels and the fidelity of features. Our approach introduces the Multi-Cue Native Perception module, which extracts intrinsic visual priors by integrating low-level texture cues with mid-level semantics, enabling precise alignment between masks and native object information. Additionally, Pseudo-Label Evolution Fusion intelligently refines labels through teacher-student interaction and utilizes depthwise separable convolution for efficient semantic denoising. It also incorporates Spectral Tensor Attention Fusion to effectively balance semantic and structural information through compact spectral aggregation across multi-layer attention maps. Finally, Local Pseudo-Label Refinement plays a pivotal role in local detail optimization by leveraging attention diversity to restore fine textures and enhance boundary fidelity. Extensive experiments on multiple UCOD datasets demonstrate that our method achieves state-of-the-art performance, characterized by superior detail perception, robust boundary alignment, and strong generalization under complex camouflage scenarios.

EReCu: Pseudo-label Evolution Fusion and Refinement with Multi-Cue Learning for Unsupervised Camouflage Detection

Abstract

Unsupervised Camouflaged Object Detection (UCOD) remains a challenging task due to the high intrinsic similarity between target objects and their surroundings, as well as the reliance on noisy pseudo-labels that hinder fine-grained texture learning. While existing refinement strategies aim to alleviate label noise, they often overlook intrinsic perceptual cues, leading to boundary overflow and structural ambiguity. In contrast, learning without pseudo-label guidance yields coarse features with significant detail loss. To address these issues, we propose a unified UCOD framework that enhances both the reliability of pseudo-labels and the fidelity of features. Our approach introduces the Multi-Cue Native Perception module, which extracts intrinsic visual priors by integrating low-level texture cues with mid-level semantics, enabling precise alignment between masks and native object information. Additionally, Pseudo-Label Evolution Fusion intelligently refines labels through teacher-student interaction and utilizes depthwise separable convolution for efficient semantic denoising. It also incorporates Spectral Tensor Attention Fusion to effectively balance semantic and structural information through compact spectral aggregation across multi-layer attention maps. Finally, Local Pseudo-Label Refinement plays a pivotal role in local detail optimization by leveraging attention diversity to restore fine textures and enhance boundary fidelity. Extensive experiments on multiple UCOD datasets demonstrate that our method achieves state-of-the-art performance, characterized by superior detail perception, robust boundary alignment, and strong generalization under complex camouflage scenarios.
Paper Structure (21 sections, 15 equations, 6 figures, 2 tables)

This paper contains 21 sections, 15 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: UCOD paradigm comparison. Traditional pseudo-label correction (a) suffers from boundary overflow due to the lack of native image cues, while feature learning methods (b) tend to generate blurred details due to the absence of pseudo-labels. Our method (c) combines pseudo-label guidance with multi-cue perception, yielding sharper boundaries and richer details.
  • Figure 2: The proposed EReCu adopts a DINO-based teacher–student architecture. The MNP module captures texture cues from the input to refine pseudo-labels and maintain accurate object boundaries. The EPL module enables students to learn robust semantic representations by leveraging teacher deep features in shallow layers. The STAF module collects multi-layer attention maps to create low-noise masks. Lastly, LPG generates local pseudo-labels from high-confidence areas of TAS-selected maps, refining boundary fidelity.
  • Figure 3: Overview of the proposed MNP. It extracts native perceptual cues by combining low-level texture features (LBP, DoG) with mid-level semantics (frozen ResNet-18), and uses random sampling to ensure robust multi-cue similarity estimation.
  • Figure 4: Overview of EPL module. It enables interaction between shallow student features $F_s^i$ and deep teacher features $F_t^{i+k}$ via DSC, and progressively refines pseudo-masks $M_s^{\mathrm{dsc}}$ and $M_t^{\mathrm{p}}$ using a hierarchical loss combining Dice and perceptual terms.
  • Figure 5: Visualization of MHSA reveals that different heads focus on distinct visual cues. Comparing individual heads, their average, and an attention-selected aggregation against the original image shows that the proposed attention selection exhibits low attention entropy and conforms to intrinsic image characteristics, thereby reducing noise interference while preserving details.
  • ...and 1 more figures