Lightweight Contrastive Distilled Hashing for Online Cross-modal Retrieval

Jiaxing Li; Lin Jiang; Zeqi Ma; Kaihang Jiang; Xiaozhao Fang; Jie Wen

Lightweight Contrastive Distilled Hashing for Online Cross-modal Retrieval

Jiaxing Li, Lin Jiang, Zeqi Ma, Kaihang Jiang, Xiaozhao Fang, Jie Wen

TL;DR

LCDH addresses the problem of lightweight, real-time cross-modal retrieval by bridging offline teacher hashing and online online hashing through similarity-matrix distillation. It fuses CLIP-derived cross-modal features with an attention module in the teacher to produce discriminative hash codes, while a lightweight student (VGG16 for images and BoW for text) generates binary codes for online updates. By approximating and aligning the offline similarity S with the online similarity S^(t) via a logistic model and a maximum likelihood objective, LCDH distills coexistent semantic relevance into the online process. Empirical results on MIRFlickr-25K, IAPR TC-12, and NUS-WIDE show LCDH achieving state-of-the-art or competitive mAP and robust Top-N/PR performance, especially at short hash lengths, demonstrating effective, practical online cross-modal retrieval with a lightweight model.

Abstract

Deep online cross-modal hashing has gained much attention from researchers recently, as its promising applications with low storage requirement, fast retrieval efficiency and cross modality adaptive, etc. However, there still exists some technical hurdles that hinder its applications, e.g., 1) how to extract the coexistent semantic relevance of cross-modal data, 2) how to achieve competitive performance when handling the real time data streams, 3) how to transfer the knowledge learned from offline to online training in a lightweight manner. To address these problems, this paper proposes a lightweight contrastive distilled hashing (LCDH) for cross-modal retrieval, by innovatively bridging the offline and online cross-modal hashing by similarity matrix approximation in a knowledge distillation framework. Specifically, in the teacher network, LCDH first extracts the cross-modal features by the contrastive language-image pre-training (CLIP), which are further fed into an attention module for representation enhancement after feature fusion. Then, the output of the attention module is fed into a FC layer to obtain hash codes for aligning the sizes of similarity matrices for online and offline training. In the student network, LCDH extracts the visual and textual features by lightweight models, and then the features are fed into a FC layer to generate binary codes. Finally, by approximating the similarity matrices, the performance of online hashing in the lightweight student network can be enhanced by the supervision of coexistent semantic relevance that is distilled from the teacher network. Experimental results on three widely used datasets demonstrate that LCDH outperforms some state-of-the-art methods.

Lightweight Contrastive Distilled Hashing for Online Cross-modal Retrieval

TL;DR

Abstract

Lightweight Contrastive Distilled Hashing for Online Cross-modal Retrieval

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)