Table of Contents
Fetching ...

A Revisit to the Decoder for Camouflaged Object Detection

Seung Woo Ko, Joopyo Hong, Suyoung Kim, Seungjai Bang, Sungzoon Cho, Nojun Kwak, Hyung-Sin Kim, Joonseok Lee

TL;DR

This work targets camouflaged object detection by redesigning the decoder with two auxiliary components: Enrich Decoder, which uses channel-wise attention to emphasize COD-relevant features and fuses multi-scale information, and Retouch Decoder, which applies spatial attention to refine object boundaries after decoding. The ENTO architecture sandwichs a base decoder with pre- and post-processing decoders, enabling high-resolution feature utilization and finer boundary delineation while remaining compatible with various encoders, including Transformers. Training supervises coarse outputs from Enrich as well as final outputs from the base and Retouch decoders using pixel-weighted BCE and IOU losses, guiding stepwise refinement. Empirically, ENTO achieves state-of-the-art performance on COD10K, CAMO, and NC4K datasets and demonstrates strong adaptability across encoder backbones, delivering superior boundary accuracy and detail with a compact decoder footprint.

Abstract

Camouflaged object detection (COD) aims to generate a fine-grained segmentation map of camouflaged objects hidden in their background. Due to the hidden nature of camouflaged objects, it is essential for the decoder to be tailored to effectively extract proper features of camouflaged objects and extra-carefully generate their complex boundaries. In this paper, we propose a novel architecture that augments the prevalent decoding strategy in COD with Enrich Decoder and Retouch Decoder, which help to generate a fine-grained segmentation map. Specifically, the Enrich Decoder amplifies the channels of features that are important for COD using channel-wise attention. Retouch Decoder further refines the segmentation maps by spatially attending to important pixels, such as the boundary regions. With extensive experiments, we demonstrate that ENTO shows superior performance using various encoders, with the two novel components playing their unique roles that are mutually complementary.

A Revisit to the Decoder for Camouflaged Object Detection

TL;DR

This work targets camouflaged object detection by redesigning the decoder with two auxiliary components: Enrich Decoder, which uses channel-wise attention to emphasize COD-relevant features and fuses multi-scale information, and Retouch Decoder, which applies spatial attention to refine object boundaries after decoding. The ENTO architecture sandwichs a base decoder with pre- and post-processing decoders, enabling high-resolution feature utilization and finer boundary delineation while remaining compatible with various encoders, including Transformers. Training supervises coarse outputs from Enrich as well as final outputs from the base and Retouch decoders using pixel-weighted BCE and IOU losses, guiding stepwise refinement. Empirically, ENTO achieves state-of-the-art performance on COD10K, CAMO, and NC4K datasets and demonstrates strong adaptability across encoder backbones, delivering superior boundary accuracy and detail with a compact decoder footprint.

Abstract

Camouflaged object detection (COD) aims to generate a fine-grained segmentation map of camouflaged objects hidden in their background. Due to the hidden nature of camouflaged objects, it is essential for the decoder to be tailored to effectively extract proper features of camouflaged objects and extra-carefully generate their complex boundaries. In this paper, we propose a novel architecture that augments the prevalent decoding strategy in COD with Enrich Decoder and Retouch Decoder, which help to generate a fine-grained segmentation map. Specifically, the Enrich Decoder amplifies the channels of features that are important for COD using channel-wise attention. Retouch Decoder further refines the segmentation maps by spatially attending to important pixels, such as the boundary regions. With extensive experiments, we demonstrate that ENTO shows superior performance using various encoders, with the two novel components playing their unique roles that are mutually complementary.

Paper Structure

This paper contains 18 sections, 5 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Examples of camouflaged objects in COD10K and NC4K.
  • Figure 2: Performances in two representative metrics, $S_\alpha$ and $F^w_\beta$, on COD10K fan2020sinet.
  • Figure 3: Overall Architecture of ENTO, comprising a feature encoder and the three consecutive decoders. We show the architecture with 4 levels ($L=4$), consistent with our full model, but the number of levels $L$ can be adjusted according to the feature encoder.
  • Figure 4: Channel Attention Block (CAB) zhang2018cab and Spatial Attention Block (SAB).
  • Figure 5: Qualitative comparison with baseline models on various types of camouflaged objects. Our model effectively captures diverse ranges of camouflages in the datasets.
  • ...and 4 more figures