Table of Contents
Fetching ...

HybridHash: Hybrid Convolutional and Self-Attention Deep Hashing for Image Retrieval

Chao He, Hongxi Wei

TL;DR

HybridHash addresses the challenge of large-scale image retrieval by producing compact binary codes that preserve semantic similarity. It fuses CNN-based local feature extraction with Transformer-based global modeling using an aggregation function to enable local self-attention, plus an interaction module to facilitate cross-block communication. The approach introduces a stage-wise HybridHash backbone and a Weighted Maximum Likelihood loss to learn discriminative hash codes. Empirical results on CIFAR-10, NUS-WIDE, and IMAGENET show state-of-the-art MAP across 16–64 bit codes, validating both the architectural design and the loss formulation. This work demonstrates the practical value of hybrid CNN–Transformer architectures for efficient, scalable image retrieval.

Abstract

Deep image hashing aims to map input images into simple binary hash codes via deep neural networks and thus enable effective large-scale image retrieval. Recently, hybrid networks that combine convolution and Transformer have achieved superior performance on various computer tasks and have attracted extensive attention from researchers. Nevertheless, the potential benefits of such hybrid networks in image retrieval still need to be verified. To this end, we propose a hybrid convolutional and self-attention deep hashing method known as HybridHash. Specifically, we propose a backbone network with stage-wise architecture in which the block aggregation function is introduced to achieve the effect of local self-attention and reduce the computational complexity. The interaction module has been elaborately designed to promote the communication of information between image blocks and to enhance the visual representations. We have conducted comprehensive experiments on three widely used datasets: CIFAR-10, NUS-WIDE and IMAGENET. The experimental results demonstrate that the method proposed in this paper has superior performance with respect to state-of-the-art deep hashing methods. Source code is available https://github.com/shuaichaochao/HybridHash.

HybridHash: Hybrid Convolutional and Self-Attention Deep Hashing for Image Retrieval

TL;DR

HybridHash addresses the challenge of large-scale image retrieval by producing compact binary codes that preserve semantic similarity. It fuses CNN-based local feature extraction with Transformer-based global modeling using an aggregation function to enable local self-attention, plus an interaction module to facilitate cross-block communication. The approach introduces a stage-wise HybridHash backbone and a Weighted Maximum Likelihood loss to learn discriminative hash codes. Empirical results on CIFAR-10, NUS-WIDE, and IMAGENET show state-of-the-art MAP across 16–64 bit codes, validating both the architectural design and the loss formulation. This work demonstrates the practical value of hybrid CNN–Transformer architectures for efficient, scalable image retrieval.

Abstract

Deep image hashing aims to map input images into simple binary hash codes via deep neural networks and thus enable effective large-scale image retrieval. Recently, hybrid networks that combine convolution and Transformer have achieved superior performance on various computer tasks and have attracted extensive attention from researchers. Nevertheless, the potential benefits of such hybrid networks in image retrieval still need to be verified. To this end, we propose a hybrid convolutional and self-attention deep hashing method known as HybridHash. Specifically, we propose a backbone network with stage-wise architecture in which the block aggregation function is introduced to achieve the effect of local self-attention and reduce the computational complexity. The interaction module has been elaborately designed to promote the communication of information between image blocks and to enhance the visual representations. We have conducted comprehensive experiments on three widely used datasets: CIFAR-10, NUS-WIDE and IMAGENET. The experimental results demonstrate that the method proposed in this paper has superior performance with respect to state-of-the-art deep hashing methods. Source code is available https://github.com/shuaichaochao/HybridHash.
Paper Structure (16 sections, 10 equations, 3 figures, 5 tables)

This paper contains 16 sections, 10 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: The detailed architecture of the proposed HybridHash. We adopt similar segmentation as ViT to divide the image with finer granularity and feed the generated image patches into the Transformer Block. The whole hybrid network consists of three stages to gradually decrease the resolution and increase the channel dimension. Interaction modules followed by each stage to promote the communication of information about the image blocks. Finally, the binary codes are output after the hash layer.
  • Figure 2: An instance demonstrating the detailed process of aggregation and disaggregation. For each feature map, we transform the original $8 \times 8$ image patches into 4 image blocks (each block is represented by different color) via the aggregation function, and each image block contains 16 image patches. Self-attention is performed exclusively within each image block.
  • Figure 3: Illustration of Interaction Module. Specifically, the interaction module incorporates two branches, the first one utilizes convolutional operations to achieve local communication of information across image blocks, and the second one utilizes block tokens to accomplish global communication of information between image blocks. Finally, the two features are fused.