Table of Contents
Fetching ...

CamoFormer: Masked Separable Attention for Camouflaged Object Detection

Bowen Yin, Xuying Zhang, Qibin Hou, Bo-Yuan Sun, Deng-Ping Fan, Luc Van Gool

TL;DR

CamoFormer tackles camouflaged object detection by explicitly modeling foreground and background cues with Masked Separable Attention and a progressive top-down decoder. By partitioning attention heads into foreground, background, and global groups and feeding soft predictions as masks, it achieves precise, boundary-aware segmentation. Across NC4K, COD10K, and CAMO, the method delivers state-of-the-art results with notable gains in S-measure and weighted F-measure, along with improved border quality. The approach demonstrates the effectiveness of masked, separable attention in binary segmentation and holds promise for broader binary segmentation applications.

Abstract

How to identify and segment camouflaged objects from the background is challenging. Inspired by the multi-head self-attention in Transformers, we present a simple masked separable attention (MSA) for camouflaged object detection. We first separate the multi-head self-attention into three parts, which are responsible for distinguishing the camouflaged objects from the background using different mask strategies. Furthermore, we propose to capture high-resolution semantic representations progressively based on a simple top-down decoder with the proposed MSA to attain precise segmentation results. These structures plus a backbone encoder form a new model, dubbed CamoFormer. Extensive experiments show that CamoFormer surpasses all existing state-of-the-art methods on three widely-used camouflaged object detection benchmarks. There are on average around 5% relative improvements over previous methods in terms of S-measure and weighted F-measure.

CamoFormer: Masked Separable Attention for Camouflaged Object Detection

TL;DR

CamoFormer tackles camouflaged object detection by explicitly modeling foreground and background cues with Masked Separable Attention and a progressive top-down decoder. By partitioning attention heads into foreground, background, and global groups and feeding soft predictions as masks, it achieves precise, boundary-aware segmentation. Across NC4K, COD10K, and CAMO, the method delivers state-of-the-art results with notable gains in S-measure and weighted F-measure, along with improved border quality. The approach demonstrates the effectiveness of masked, separable attention in binary segmentation and holds promise for broader binary segmentation applications.

Abstract

How to identify and segment camouflaged objects from the background is challenging. Inspired by the multi-head self-attention in Transformers, we present a simple masked separable attention (MSA) for camouflaged object detection. We first separate the multi-head self-attention into three parts, which are responsible for distinguishing the camouflaged objects from the background using different mask strategies. Furthermore, we propose to capture high-resolution semantic representations progressively based on a simple top-down decoder with the proposed MSA to attain precise segmentation results. These structures plus a backbone encoder form a new model, dubbed CamoFormer. Extensive experiments show that CamoFormer surpasses all existing state-of-the-art methods on three widely-used camouflaged object detection benchmarks. There are on average around 5% relative improvements over previous methods in terms of S-measure and weighted F-measure.
Paper Structure (17 sections, 7 equations, 10 figures, 7 tables)

This paper contains 17 sections, 7 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Visual comparison between our CamoFormer and recent state-of-the-art methods (e.g., SegMaR Jia_2022_CVPR and ZoomNet Pang_2022_CVPR) for camouflaged object detection. The segmentation details of different methods in the green rectangle regions are displayed with focus views. We can easily observe that our CamoFormer can generate much better results than other methods. Best viewed in color.
  • Figure 2: Overall architecture of our CamoFormer model. First, a pretrained Transformer-based backbone is utilized to extract multi-scale features of the input image. Then, the features from the last three stages are aggregated to generate the coarse prediction. Next, the progressive refinement decoder equipped with masked separable attention (MSA) is applied to gradually polish the prediction results. All the predictions generated by our CamoFormer are supervised by the ground truth (GT).
  • Figure 3: Diagrammatic details of the proposed F-TA in our MSA. Our B-TA shares a similar structure except for the mask.
  • Figure 4: Visualization comparisons between our CamoFormer and other SOTA methods. Segmentation results are shown in orange.
  • Figure 5: Comparisons of our CamoFormer and other SOTA methods on the borders of segmentation. The borders of GT are marked in white, and the ones of predictions are in orange.
  • ...and 5 more figures