Distilling Vision-Language Pretraining for Efficient Cross-Modal Retrieval

Young Kyun Jang; Donghyun Kim; Ser-nam Lim

Distilling Vision-Language Pretraining for Efficient Cross-Modal Retrieval

Young Kyun Jang, Donghyun Kim, Ser-nam Lim

TL;DR

DCMQ distills the semantic richness of Vision-Language Pretraining into a compact, hashing-based cross-modal retrieval framework. It introduces Normalization with Paired Consistency (NPC) to produce discriminative target similarities from VLP embeddings and Product Quantization with Gumbel (PQG) to balance codeword usage during training. The method jointly trains image/text encoders and codebooks under VLP supervision, achieving state-of-the-art mAP on MS COCO, MIRFlickr, and NUS-WIDE across 32–128 bit codes, with substantial storage and speed advantages. This VLP-guided quantization approach enables scalable, efficient retrieval while preserving fine-grained cross-modal semantics, and it generalizes across multiple VLP teachers and backbones.

Abstract

``Learning to hash'' is a practical solution for efficient retrieval, offering fast search speed and low storage cost. It is widely applied in various applications, such as image-text cross-modal search. In this paper, we explore the potential of enhancing the performance of learning to hash with the proliferation of powerful large pre-trained models, such as Vision-Language Pre-training (VLP) models. We introduce a novel method named Distillation for Cross-Modal Quantization (DCMQ), which leverages the rich semantic knowledge of VLP models to improve hash representation learning. Specifically, we use the VLP as a `teacher' to distill knowledge into a `student' hashing model equipped with codebooks. This process involves the replacement of supervised labels, which are composed of multi-hot vectors and lack semantics, with the rich semantics of VLP. In the end, we apply a transformation termed Normalization with Paired Consistency (NPC) to achieve a discriminative target for distillation. Further, we introduce a new quantization method, Product Quantization with Gumbel (PQG) that promotes balanced codebook learning, thereby improving the retrieval performance. Extensive benchmark testing demonstrates that DCMQ consistently outperforms existing supervised cross-modal hashing approaches, showcasing its significant potential.

Distilling Vision-Language Pretraining for Efficient Cross-Modal Retrieval

TL;DR

Abstract

Paper Structure (28 sections, 6 equations, 9 figures, 9 tables, 1 algorithm)

This paper contains 28 sections, 6 equations, 9 figures, 9 tables, 1 algorithm.

Introduction
Related Work
Method
Overview
VLP as Supervisory Teacher with NPC
Product Quantization with Gumbel
Finding Cross-Modal Alignment
Inference
Experiments
Settings
Datasets.
Comparing with Existing Methods
Ablation Study
The impact of knowledge distillation with VLP teachers.
The impact of NPC.
...and 13 more sections

Figures (9)

Figure 1: A comparison of the training process between (a) existing supervised deep hashing and (b) our DCMQ. Supervised labels, which only indicate category presence, lack detailed cross-modal semantic information. To address this, we convert these labels into text and employ Vision-Language Pretraining (VLP) to compute cross-modal similarity between images and labels, thereby forming the target distribution for knowledge distillation. Performance gains from this process are achieved through the introduction of two crucial techniques: Normalization with Paired Consistency (NPC) and Product Quantization with Gumbel noise (PQG).
Figure 2: An example of how NPC impacts the cosine similarity scores between image and text output embeddings of VLP. Without NPC, the scores are within the range of 0.05 to 0.19, which do not effectively distinguish between relevant (positive) and non-relevant (negative) text. However, with NPC, the scores are normalized within the range of -1.0 to 1.0, showing enhanced semantic relevance between image and texts.
Figure 3: An example similarity matrix computed with 64 VLP image and text embeddings. After NPC, scores become more dominant while keeping paired consistency.
Figure 4: (a) PQG on multi-modal data. $l$-th sub-vector of Soft-quantized embeddings; $\mathbf{\hat{z}}^i_{l}$ and $\mathbf{\hat{z}}^t_{l}$ are created by a combination of codewords in $l$-th codebook $\mathbf{C}_l$. (b) Joint training of image ($\mathbf{x}^i$, $\mathbf{z}^i$) and text ($\mathbf{x}^t$, $\mathbf{z}^t$) embeddings in order to obtain cross-modal alignment between the original and quantized representation.
Figure 5: Top 1000-precision (a, b), and Recall at 1000 (c, d) curves on MIRFlickr @64-bit.
...and 4 more figures

Distilling Vision-Language Pretraining for Efficient Cross-Modal Retrieval

TL;DR

Abstract

Distilling Vision-Language Pretraining for Efficient Cross-Modal Retrieval

Authors

TL;DR

Abstract

Table of Contents

Figures (9)