Table of Contents
Fetching ...

A Simple yet Effective Network based on Vision Transformer for Camouflaged Object and Salient Object Detection

Chao Hao, Zitong Yu, Xin Liu, Jun Xu, Huanjing Yue, Jingyu Yang

TL;DR

This work tackles the challenge of a universal network for both camouflaged object detection (COD) and salient object detection (SOD) by introducing SENet, a simple yet effective Vision Transformer (ViT)-based asymmetric encoder-decoder. It augments the ViT core with a Local Information Capture Module (LICM) to inject local context and a Dynamic Weighted (DW) loss to emphasize small or difficult targets, while leveraging an MAE-inspired image reconstruction task as a beneficial auxiliary objective. The authors also explore joint training of COD and SOD with two paradigms, revealing a practical yet challenging trade-off between tasks and showing that sharing an encoder with task-specific decoders can mitigate conflicts. Extensive experiments across nine benchmark datasets demonstrate state-of-the-art performance on both COD and SOD, with ablations underscoring the importance of LICM, DW loss, and MAE pretraining for achieving competitive results.

Abstract

Camouflaged object detection (COD) and salient object detection (SOD) are two distinct yet closely-related computer vision tasks widely studied during the past decades. Though sharing the same purpose of segmenting an image into binary foreground and background regions, their distinction lies in the fact that COD focuses on concealed objects hidden in the image, while SOD concentrates on the most prominent objects in the image. Previous works achieved good performance by stacking various hand-designed modules and multi-scale features. However, these carefully-designed complex networks often performed well on one task but not on another. In this work, we propose a simple yet effective network (SENet) based on vision Transformer (ViT), by employing a simple design of an asymmetric ViT-based encoder-decoder structure, we yield competitive results on both tasks, exhibiting greater versatility than meticulously crafted ones. Furthermore, to enhance the Transformer's ability to model local information, which is important for pixel-level binary segmentation tasks, we propose a local information capture module (LICM). We also propose a dynamic weighted loss (DW loss) based on Binary Cross-Entropy (BCE) and Intersection over Union (IoU) loss, which guides the network to pay more attention to those smaller and more difficult-to-find target objects according to their size. Moreover, we explore the issue of joint training of SOD and COD, and propose a preliminary solution to the conflict in joint training, further improving the performance of SOD. Extensive experiments on multiple benchmark datasets demonstrate the effectiveness of our method. The code is available at https://github.com/linuxsino/SENet.

A Simple yet Effective Network based on Vision Transformer for Camouflaged Object and Salient Object Detection

TL;DR

This work tackles the challenge of a universal network for both camouflaged object detection (COD) and salient object detection (SOD) by introducing SENet, a simple yet effective Vision Transformer (ViT)-based asymmetric encoder-decoder. It augments the ViT core with a Local Information Capture Module (LICM) to inject local context and a Dynamic Weighted (DW) loss to emphasize small or difficult targets, while leveraging an MAE-inspired image reconstruction task as a beneficial auxiliary objective. The authors also explore joint training of COD and SOD with two paradigms, revealing a practical yet challenging trade-off between tasks and showing that sharing an encoder with task-specific decoders can mitigate conflicts. Extensive experiments across nine benchmark datasets demonstrate state-of-the-art performance on both COD and SOD, with ablations underscoring the importance of LICM, DW loss, and MAE pretraining for achieving competitive results.

Abstract

Camouflaged object detection (COD) and salient object detection (SOD) are two distinct yet closely-related computer vision tasks widely studied during the past decades. Though sharing the same purpose of segmenting an image into binary foreground and background regions, their distinction lies in the fact that COD focuses on concealed objects hidden in the image, while SOD concentrates on the most prominent objects in the image. Previous works achieved good performance by stacking various hand-designed modules and multi-scale features. However, these carefully-designed complex networks often performed well on one task but not on another. In this work, we propose a simple yet effective network (SENet) based on vision Transformer (ViT), by employing a simple design of an asymmetric ViT-based encoder-decoder structure, we yield competitive results on both tasks, exhibiting greater versatility than meticulously crafted ones. Furthermore, to enhance the Transformer's ability to model local information, which is important for pixel-level binary segmentation tasks, we propose a local information capture module (LICM). We also propose a dynamic weighted loss (DW loss) based on Binary Cross-Entropy (BCE) and Intersection over Union (IoU) loss, which guides the network to pay more attention to those smaller and more difficult-to-find target objects according to their size. Moreover, we explore the issue of joint training of SOD and COD, and propose a preliminary solution to the conflict in joint training, further improving the performance of SOD. Extensive experiments on multiple benchmark datasets demonstrate the effectiveness of our method. The code is available at https://github.com/linuxsino/SENet.
Paper Structure (18 sections, 6 equations, 10 figures, 6 tables, 1 algorithm)

This paper contains 18 sections, 6 equations, 10 figures, 6 tables, 1 algorithm.

Figures (10)

  • Figure 1: SENet (Ours) achieves the best performance on nine datasets of COD (in purple) and SOD (in brown) compared with methods UJSC UJSC, F3Net f3net and SINet SINet. The performance on the COD datasets is evaluated by testing the model trained on COD training sets, and the same applies to SOD. The more challenging the dataset, the greater the lead achieved by SENet. For the evaluation metric used here, see Eq. (\ref{['eq6']}).
  • Figure 2: Illustration of the proposed SENet with its training process. Our SENet mainly consists of two parts: an encoder and a decoder composed of two asymmetric ViTs. The proposed LICM is seamlessly integrated in parallel with both the multi-head self-attention (MHSA) layer and the multi-layer perceptron (MLP) layer within every Transformer block. Masked images are utilized as input, and supervised training on the network is conducted by employing both the loss for the image reconstruction task and the loss for the binary segmentation task. $\otimes$ indicates pixel-wise multiplication. See the Appendix for the illustration of the inference process of SENet.
  • Figure 3: Illustration of the proposed LICM. "Unpatchify" and "Patchify" here are essentially reshape operations. LICM is able to capture more localized information by converting tokens into patches and then performing a small-kernel convolution.
  • Figure 4: Illustration of two joint training paradigms. Both paradigms are based on SENet, and only the main encoder and decoder parts of SENet are retained in the illustration. (a) For Paradigm 1, two tasks share the entire network. (b) For Paradigm 2, two tasks share the encoder but have independent decoders.
  • Figure 5: Visual comparison of our camouflage predictions with the state-of-the-art (SOTA) methods on different types of COD samples. Please zoom in for more details.
  • ...and 5 more figures