Table of Contents
Fetching ...

Attributes Grouping and Mining Hashing for Fine-Grained Image Retrieval

Xin Lu, Shikun Chen, Yichao Cao, Xin Zhou, Xiaobo Lu

TL;DR

This work tackles fine-grained image retrieval by introducing AGMH, a hashing framework that learns multiple attribute descriptors to capture diverse subtle differences. It replaces single-tensor attention with a set of descriptors, guided by Attention Dispersion Loss (ADL) to encourage diverse attribute focus, and Stepwise Interactive External Attention (SIEA) to mine discrete attributes without increasing test-time cost. The method maps concatenated descriptor representations to compact binary codes via a hash-learning objective that preserves pairwise similarities, and supports out-of-sample extensions efficiently. Empirical results on five fine-grained datasets demonstrate state-of-the-art performance, particularly at short hash lengths, validating both the effectiveness of descriptor grouping and the practicality of the approach for large-scale retrieval.

Abstract

In recent years, hashing methods have been popular in the large-scale media search for low storage and strong representation capabilities. To describe objects with similar overall appearance but subtle differences, more and more studies focus on hashing-based fine-grained image retrieval. Existing hashing networks usually generate both local and global features through attention guidance on the same deep activation tensor, which limits the diversity of feature representations. To handle this limitation, we substitute convolutional descriptors for attention-guided features and propose an Attributes Grouping and Mining Hashing (AGMH), which groups and embeds the category-specific visual attributes in multiple descriptors to generate a comprehensive feature representation for efficient fine-grained image retrieval. Specifically, an Attention Dispersion Loss (ADL) is designed to force the descriptors to attend to various local regions and capture diverse subtle details. Moreover, we propose a Stepwise Interactive External Attention (SIEA) to mine critical attributes in each descriptor and construct correlations between fine-grained attributes and objects. The attention mechanism is dedicated to learning discrete attributes, which will not cost additional computations in hash codes generation. Finally, the compact binary codes are learned by preserving pairwise similarities. Experimental results demonstrate that AGMH consistently yields the best performance against state-of-the-art methods on fine-grained benchmark datasets.

Attributes Grouping and Mining Hashing for Fine-Grained Image Retrieval

TL;DR

This work tackles fine-grained image retrieval by introducing AGMH, a hashing framework that learns multiple attribute descriptors to capture diverse subtle differences. It replaces single-tensor attention with a set of descriptors, guided by Attention Dispersion Loss (ADL) to encourage diverse attribute focus, and Stepwise Interactive External Attention (SIEA) to mine discrete attributes without increasing test-time cost. The method maps concatenated descriptor representations to compact binary codes via a hash-learning objective that preserves pairwise similarities, and supports out-of-sample extensions efficiently. Empirical results on five fine-grained datasets demonstrate state-of-the-art performance, particularly at short hash lengths, validating both the effectiveness of descriptor grouping and the practicality of the approach for large-scale retrieval.

Abstract

In recent years, hashing methods have been popular in the large-scale media search for low storage and strong representation capabilities. To describe objects with similar overall appearance but subtle differences, more and more studies focus on hashing-based fine-grained image retrieval. Existing hashing networks usually generate both local and global features through attention guidance on the same deep activation tensor, which limits the diversity of feature representations. To handle this limitation, we substitute convolutional descriptors for attention-guided features and propose an Attributes Grouping and Mining Hashing (AGMH), which groups and embeds the category-specific visual attributes in multiple descriptors to generate a comprehensive feature representation for efficient fine-grained image retrieval. Specifically, an Attention Dispersion Loss (ADL) is designed to force the descriptors to attend to various local regions and capture diverse subtle details. Moreover, we propose a Stepwise Interactive External Attention (SIEA) to mine critical attributes in each descriptor and construct correlations between fine-grained attributes and objects. The attention mechanism is dedicated to learning discrete attributes, which will not cost additional computations in hash codes generation. Finally, the compact binary codes are learned by preserving pairwise similarities. Experimental results demonstrate that AGMH consistently yields the best performance against state-of-the-art methods on fine-grained benchmark datasets.
Paper Structure (16 sections, 17 equations, 6 figures, 5 tables)

This paper contains 16 sections, 17 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: The common framework of existing fine-grained hashing networks, which is mainly composed of fine-grained representation learning and hash codes generating. It relies on a set of attention guidance to generate diverse local and global features from the same intermediate features.
  • Figure 2: The overall framework of AGMH. In the high-level feature extraction module, an image is fed into the backbone network to obtain base feature maps. Then, a set of convolution operations are conducted to group the object attributes and embed them in different descriptors. For the attributes mining module, the external attention mechanisms are attached to explore crucial attributes and learn dispersed attention. Finally, the compact binary codes are generated in the last module.
  • Figure 3: The overview of Stepwise Interactive External Attention.
  • Figure 4: Visualization of aggregation maps of attentive descriptor groups, which are generated from the objects in various fine-grained datasets. The focused regions are located in different local attributes.
  • Figure 5: The mAP results on CUB200-2011 with code length varying from 16 to 64. FISH$^{-}$ indicates the FISH trained without classification losses.
  • ...and 1 more figures