Table of Contents
Fetching ...

Hierarchical Graph Interaction Transformer with Dynamic Token Clustering for Camouflaged Object Detection

Siyuan Yao, Hao Sun, Tian-Zhu Xiang, Xiao Wang, Xiaochun Cao

TL;DR

A hierarchical graph interaction network termed HGINet is proposed, which is capable of discovering imperceptible objects via effective graph interaction among the hierarchical tokenized features via effective graph interaction among the hierarchical tokenized features.

Abstract

Camouflaged object detection (COD) aims to identify the objects that seamlessly blend into the surrounding backgrounds. Due to the intrinsic similarity between the camouflaged objects and the background region, it is extremely challenging to precisely distinguish the camouflaged objects by existing approaches. In this paper, we propose a hierarchical graph interaction network termed HGINet for camouflaged object detection, which is capable of discovering imperceptible objects via effective graph interaction among the hierarchical tokenized features. Specifically, we first design a region-aware token focusing attention (RTFA) with dynamic token clustering to excavate the potentially distinguishable tokens in the local region. Afterwards, a hierarchical graph interaction transformer (HGIT) is proposed to construct bi-directional aligned communication between hierarchical features in the latent interaction space for visual semantics enhancement. Furthermore, we propose a decoder network with confidence aggregated feature fusion (CAFF) modules, which progressively fuses the hierarchical interacted features to refine the local detail in ambiguous regions. Extensive experiments conducted on the prevalent datasets, i.e. COD10K, CAMO, NC4K and CHAMELEON demonstrate the superior performance of HGINet compared to existing state-of-the-art methods. Our code is available at https://github.com/Garyson1204/HGINet.

Hierarchical Graph Interaction Transformer with Dynamic Token Clustering for Camouflaged Object Detection

TL;DR

A hierarchical graph interaction network termed HGINet is proposed, which is capable of discovering imperceptible objects via effective graph interaction among the hierarchical tokenized features via effective graph interaction among the hierarchical tokenized features.

Abstract

Camouflaged object detection (COD) aims to identify the objects that seamlessly blend into the surrounding backgrounds. Due to the intrinsic similarity between the camouflaged objects and the background region, it is extremely challenging to precisely distinguish the camouflaged objects by existing approaches. In this paper, we propose a hierarchical graph interaction network termed HGINet for camouflaged object detection, which is capable of discovering imperceptible objects via effective graph interaction among the hierarchical tokenized features. Specifically, we first design a region-aware token focusing attention (RTFA) with dynamic token clustering to excavate the potentially distinguishable tokens in the local region. Afterwards, a hierarchical graph interaction transformer (HGIT) is proposed to construct bi-directional aligned communication between hierarchical features in the latent interaction space for visual semantics enhancement. Furthermore, we propose a decoder network with confidence aggregated feature fusion (CAFF) modules, which progressively fuses the hierarchical interacted features to refine the local detail in ambiguous regions. Extensive experiments conducted on the prevalent datasets, i.e. COD10K, CAMO, NC4K and CHAMELEON demonstrate the superior performance of HGINet compared to existing state-of-the-art methods. Our code is available at https://github.com/Garyson1204/HGINet.
Paper Structure (16 sections, 17 equations, 7 figures, 9 tables)

This paper contains 16 sections, 17 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Comparison of the prediction results obtained by HGINet with MGL zhai2021mutual and FSPNet huang2023feature. (a) Visual comparison in different challenging scenarios; (b) The intermediate features learned by HGINet and the SOTA methods.
  • Figure 2: The overall architecture of the proposed HGINet. It mainly consists of a transformer backbone with multiple RTFA blocks, a hierarchical graph interaction transformer (HGIT), and a decoder network with confidence aggregated feature fusion (CAFF) modules. (a) illustrates our RTFA, i.e., region-aware token focusing module, which consists of a pooling and dynamic token clustering strategy to excavate the most distinguishable tokens. (b) demonstrates our graph projection and reprojection strategy in latent space.
  • Figure 3: Details of our decoder network with CAFF modules. "Fusion" in the decoder network consists of a Conv-BN-ReLU layer and Pixel Shuffle.
  • Figure 4: Visual comparison with several representative state-of-the-art methods in challenging scenarios, including small, large, multiple, occluded objects, and confused boundaries with great uncertainty. Please zoom in for details.
  • Figure 5: Visualization results of features before and after passing the HGIT module in HGINet. Features before and after passing the graph convolution networks in FSPNethuang2023feature and MGLzhai2021mutual are also presented for comparison. Please zoom in for details.
  • ...and 2 more figures