Table of Contents
Fetching ...

Transformer-based Clipped Contrastive Quantization Learning for Unsupervised Image Retrieval

Ayush Dubey, Shiv Ram Dubey, Satish Kumar Singh, Wei-Ta Chu

TL;DR

This work tackles unsupervised image retrieval by addressing two core issues: the reliance on CNN backbones with limited global context and the presence of false negatives in contrastive learning. It introduces TransClippedCLR, a Transformer-based framework that uses a Vision Transformer to capture global image context, product quantization to generate compact hash codes, and a clipped contrastive loss to mitigate false negatives, coupled with retrieval via asymmetric quantized similarity. The approach delivers substantial gains across CIFAR-10, NUS-WIDE, and Flickr25K compared with state-of-the-art baselines, and demonstrates that clipping and larger batch sizes further boost performance. Overall, TransClippedCLR provides a scalable, effective solution for unsupervised image retrieval, with strong generalization to different backbones and code lengths.

Abstract

Unsupervised image retrieval aims to learn the important visual characteristics without any given level to retrieve the similar images for a given query image. The Convolutional Neural Network (CNN)-based approaches have been extensively exploited with self-supervised contrastive learning for image hashing. However, the existing approaches suffer due to lack of effective utilization of global features by CNNs and biased-ness created by false negative pairs in the contrastive learning. In this paper, we propose a TransClippedCLR model by encoding the global context of an image using Transformer having local context through patch based processing, by generating the hash codes through product quantization and by avoiding the potential false negative pairs through clipped contrastive learning. The proposed model is tested with superior performance for unsupervised image retrieval on benchmark datasets, including CIFAR10, NUS-Wide and Flickr25K, as compared to the recent state-of-the-art deep models. The results using the proposed clipped contrastive learning are greatly improved on all datasets as compared to same backbone network with vanilla contrastive learning.

Transformer-based Clipped Contrastive Quantization Learning for Unsupervised Image Retrieval

TL;DR

This work tackles unsupervised image retrieval by addressing two core issues: the reliance on CNN backbones with limited global context and the presence of false negatives in contrastive learning. It introduces TransClippedCLR, a Transformer-based framework that uses a Vision Transformer to capture global image context, product quantization to generate compact hash codes, and a clipped contrastive loss to mitigate false negatives, coupled with retrieval via asymmetric quantized similarity. The approach delivers substantial gains across CIFAR-10, NUS-WIDE, and Flickr25K compared with state-of-the-art baselines, and demonstrates that clipping and larger batch sizes further boost performance. Overall, TransClippedCLR provides a scalable, effective solution for unsupervised image retrieval, with strong generalization to different backbones and code lengths.

Abstract

Unsupervised image retrieval aims to learn the important visual characteristics without any given level to retrieve the similar images for a given query image. The Convolutional Neural Network (CNN)-based approaches have been extensively exploited with self-supervised contrastive learning for image hashing. However, the existing approaches suffer due to lack of effective utilization of global features by CNNs and biased-ness created by false negative pairs in the contrastive learning. In this paper, we propose a TransClippedCLR model by encoding the global context of an image using Transformer having local context through patch based processing, by generating the hash codes through product quantization and by avoiding the potential false negative pairs through clipped contrastive learning. The proposed model is tested with superior performance for unsupervised image retrieval on benchmark datasets, including CIFAR10, NUS-Wide and Flickr25K, as compared to the recent state-of-the-art deep models. The results using the proposed clipped contrastive learning are greatly improved on all datasets as compared to same backbone network with vanilla contrastive learning.
Paper Structure (14 sections, 3 equations, 3 figures, 5 tables)

This paper contains 14 sections, 3 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: The augmented view is considered as the positive key while rest of the views from batch make up the negative keys. Because the negative keys are randomly sampled false negatives can occur and hamper training. The views of the second image of the cat marked in red are false negatives.
  • Figure 2: Schematic diagram of the proposed TransClippedCLR model.
  • Figure 3: The ViT network consisting of patch embedding, transformer encoder blocks, and multi-layer perceptron vit.