Table of Contents
Fetching ...

ZoomNeXt: A Unified Collaborative Pyramid Network for Camouflaged Object Detection

Youwei Pang, Xiaoqi Zhao, Tian-Zhu Xiang, Lihe Zhang, Huchuan Lu

TL;DR

ZoomNeXt introduces a unified image-video camouflaged object detector that mimics human zooming behavior by processing scale-space representations. It combines a shared Triplet Feature Encoder, a Scale Merging Subnetwork withMulti-Head Scale Integration Units, and a Hierarchical Difference Propagation Decoder with Rich Granularity Perception Units, enhanced by a Difference-aware routing mechanism for temporal cues. A novel Uncertainty Awareness Loss biases predictions toward high confidence in candidate regions, enabling more robust COD under challenging textures and occlusions. Empirically, ZoomNeXt achieves state-of-the-art results on both image and video COD benchmarks, with comprehensive ablations validating the contributions of each component and the unification of pipelines across tasks.

Abstract

Recent camouflaged object detection (COD) attempts to segment objects visually blended into their surroundings, which is extremely complex and difficult in real-world scenarios. Apart from the high intrinsic similarity between camouflaged objects and their background, objects are usually diverse in scale, fuzzy in appearance, and even severely occluded. To this end, we propose an effective unified collaborative pyramid network that mimics human behavior when observing vague images and videos, \ie zooming in and out. Specifically, our approach employs the zooming strategy to learn discriminative mixed-scale semantics by the multi-head scale integration and rich granularity perception units, which are designed to fully explore imperceptible clues between candidate objects and background surroundings. The former's intrinsic multi-head aggregation provides more diverse visual patterns. The latter's routing mechanism can effectively propagate inter-frame differences in spatiotemporal scenarios and be adaptively deactivated and output all-zero results for static representations. They provide a solid foundation for realizing a unified architecture for static and dynamic COD. Moreover, considering the uncertainty and ambiguity derived from indistinguishable textures, we construct a simple yet effective regularization, uncertainty awareness loss, to encourage predictions with higher confidence in candidate regions. Our highly task-friendly framework consistently outperforms existing state-of-the-art methods in image and video COD benchmarks. Our code can be found at {https://github.com/lartpang/ZoomNeXt}.

ZoomNeXt: A Unified Collaborative Pyramid Network for Camouflaged Object Detection

TL;DR

ZoomNeXt introduces a unified image-video camouflaged object detector that mimics human zooming behavior by processing scale-space representations. It combines a shared Triplet Feature Encoder, a Scale Merging Subnetwork withMulti-Head Scale Integration Units, and a Hierarchical Difference Propagation Decoder with Rich Granularity Perception Units, enhanced by a Difference-aware routing mechanism for temporal cues. A novel Uncertainty Awareness Loss biases predictions toward high confidence in candidate regions, enabling more robust COD under challenging textures and occlusions. Empirically, ZoomNeXt achieves state-of-the-art results on both image and video COD benchmarks, with comprehensive ablations validating the contributions of each component and the unification of pipelines across tasks.

Abstract

Recent camouflaged object detection (COD) attempts to segment objects visually blended into their surroundings, which is extremely complex and difficult in real-world scenarios. Apart from the high intrinsic similarity between camouflaged objects and their background, objects are usually diverse in scale, fuzzy in appearance, and even severely occluded. To this end, we propose an effective unified collaborative pyramid network that mimics human behavior when observing vague images and videos, \ie zooming in and out. Specifically, our approach employs the zooming strategy to learn discriminative mixed-scale semantics by the multi-head scale integration and rich granularity perception units, which are designed to fully explore imperceptible clues between candidate objects and background surroundings. The former's intrinsic multi-head aggregation provides more diverse visual patterns. The latter's routing mechanism can effectively propagate inter-frame differences in spatiotemporal scenarios and be adaptively deactivated and output all-zero results for static representations. They provide a solid foundation for realizing a unified architecture for static and dynamic COD. Moreover, considering the uncertainty and ambiguity derived from indistinguishable textures, we construct a simple yet effective regularization, uncertainty awareness loss, to encourage predictions with higher confidence in candidate regions. Our highly task-friendly framework consistently outperforms existing state-of-the-art methods in image and video COD benchmarks. Our code can be found at {https://github.com/lartpang/ZoomNeXt}.
Paper Structure (25 sections, 2 equations, 12 figures, 8 tables, 1 algorithm)

This paper contains 25 sections, 2 equations, 12 figures, 8 tables, 1 algorithm.

Figures (12)

  • Figure 1: Comparison of the proposed framework with existing paradigms. Existing methods feed the image and video inputs into task-specific architecture to yield camouflaged object map. Unlike them, our ZoomNeXt, which is founded on the flexible difference-aware routing mechanism , unifies and simplifies the processing pipeline of image and video COD tasks, which shares powerful static components without introducing redundant computation.
  • Figure 2: Overall framework of the proposed ZoomNeXt. The shared triplet feature encoder is used to extract multi-level features corresponding to different input "zoom" scales. At different levels of the scale merging subnetwork, MHSIUs are adopted to screen and aggregate the critical cues from different scales. Then the fused features are gradually integrated through the top-down up-sampling path in the hierarchical difference propagation decoder. RGPUs further enhance the feature discrimination by constructing a multi-path structure inside the features. Finally, the probability map of the camouflaged object corresponding to the input image or frame can be obtained. In the training stage, the binary cross entropy and the proposed uncertainty awareness loss are used as the loss function.
  • Figure 3: Illustration of the multi-head scale integration unit. $\otimes$ is the element-wise multiplication. $\phi$ and $\gamma$ are the parameters of the two separated group-wise transformation layers. More details can be found in Sec. \ref{['sec:scale_merging']}.
  • Figure 4: Rich granularity perception unit where $\otimes$, $\oplus$, and $\ominus$ are the element-wise multiplication, addition, and subtraction. Group-wise interaction and channel-wise modulation are used to explore the discriminative and valuable semantics from different channels. Each feature group is executed sequentially and the latter one integrates part of the features of the previous one before the feature transformation. The temporal shifting operation shifts the frame feature maps along the temporal dimension and some temporal convolutional layers diffuse the temporal cues as stated in Sec. \ref{['sec:decoder']}.
  • Figure 5: Curves of different forms of the proposed UAL.
  • ...and 7 more figures