Table of Contents
Fetching ...

Image Hashing via Cross-View Code Alignment in the Age of Foundation Models

Ilyass Moummad, Kawtar Zaher, Hervé Goëau, Alexis Joly

TL;DR

This work tackles the challenge of large-scale retrieval with foundation-model embeddings by proposing CroVCA, a simple, unified principle for learning binary codes that remain aligned across semantically consistent views. It centers on a lightweight HashCoder network and a joint objective that combines alignment via binary cross-entropy with a coding-rate surrogate to promote code diversity, enabling both unsupervised and supervised hashing. The approach achieves state-of-the-art or competitive retrieval performance in just 5 training epochs on a single GPU, with 16-bit codes preserving meaningful semantic structure and enabling rapid deployment. Importantly, HashCoder can probe frozen encoders or be fine-tuned efficiently via LoRA, and shows strong transferability from ImageNet-1k to diverse downstream datasets, highlighting its broad applicability for fast, scalable retrieval workflows.

Abstract

Efficient large-scale retrieval requires representations that are both compact and discriminative. Foundation models provide powerful visual and multimodal embeddings, but nearest neighbor search in these high-dimensional spaces is computationally expensive. Hashing offers an efficient alternative by enabling fast Hamming distance search with binary codes, yet existing approaches often rely on complex pipelines, multi-term objectives, designs specialized for a single learning paradigm, and long training times. We introduce CroVCA (Cross-View Code Alignment), a simple and unified principle for learning binary codes that remain consistent across semantically aligned views. A single binary cross-entropy loss enforces alignment, while coding-rate maximization serves as an anti-collapse regularizer to promote balanced and diverse codes. To implement this, we design HashCoder, a lightweight MLP hashing network with a final batch normalization layer to enforce balanced codes. HashCoder can be used as a probing head on frozen embeddings or to adapt encoders efficiently via LoRA fine-tuning. Across benchmarks, CroVCA achieves state-of-the-art results in just 5 training epochs. At 16 bits, it particularly well-for instance, unsupervised hashing on COCO completes in under 2 minutes and supervised hashing on ImageNet100 in about 3 minutes on a single GPU. These results highlight CroVCA's efficiency, adaptability, and broad applicability.

Image Hashing via Cross-View Code Alignment in the Age of Foundation Models

TL;DR

This work tackles the challenge of large-scale retrieval with foundation-model embeddings by proposing CroVCA, a simple, unified principle for learning binary codes that remain aligned across semantically consistent views. It centers on a lightweight HashCoder network and a joint objective that combines alignment via binary cross-entropy with a coding-rate surrogate to promote code diversity, enabling both unsupervised and supervised hashing. The approach achieves state-of-the-art or competitive retrieval performance in just 5 training epochs on a single GPU, with 16-bit codes preserving meaningful semantic structure and enabling rapid deployment. Importantly, HashCoder can probe frozen encoders or be fine-tuned efficiently via LoRA, and shows strong transferability from ImageNet-1k to diverse downstream datasets, highlighting its broad applicability for fast, scalable retrieval workflows.

Abstract

Efficient large-scale retrieval requires representations that are both compact and discriminative. Foundation models provide powerful visual and multimodal embeddings, but nearest neighbor search in these high-dimensional spaces is computationally expensive. Hashing offers an efficient alternative by enabling fast Hamming distance search with binary codes, yet existing approaches often rely on complex pipelines, multi-term objectives, designs specialized for a single learning paradigm, and long training times. We introduce CroVCA (Cross-View Code Alignment), a simple and unified principle for learning binary codes that remain consistent across semantically aligned views. A single binary cross-entropy loss enforces alignment, while coding-rate maximization serves as an anti-collapse regularizer to promote balanced and diverse codes. To implement this, we design HashCoder, a lightweight MLP hashing network with a final batch normalization layer to enforce balanced codes. HashCoder can be used as a probing head on frozen embeddings or to adapt encoders efficiently via LoRA fine-tuning. Across benchmarks, CroVCA achieves state-of-the-art results in just 5 training epochs. At 16 bits, it particularly well-for instance, unsupervised hashing on COCO completes in under 2 minutes and supervised hashing on ImageNet100 in about 3 minutes on a single GPU. These results highlight CroVCA's efficiency, adaptability, and broad applicability.

Paper Structure

This paper contains 27 sections, 11 equations, 9 figures, 12 tables.

Figures (9)

  • Figure 1: Cross-view code alignment for different hashing setups: unsupervised (left), supervised (middle), and cross-modal (right). Encoders are either frozen or fine-tuned via LoRA.
  • Figure 2: HashCoder design
  • Figure 3: t-SNE of CIFAR10 embeddings: original 768-dim (left) vs. 16-dim HashCoder (right).
  • Figure 4: Nearest neighbors of a zebra using SimDINOv2 features vs. HashCoder's 16-bit codes.
  • Figure 5: ImageNet100 retrieval results for two queries. Rows: original SimDINOv2 768-dim embeddings; Hashing-Baseline (H-B) hashing-baseline; HashCoder with 16-bit codes trained via unsupervised LoRA fine-tuning.
  • ...and 4 more figures