Table of Contents
Fetching ...

DEMO: A Statistical Perspective for Efficient Image-Text Matching

Fan Zhang, Xian-Sheng Hua, Chong Chen, Xiao Luo

TL;DR

DEMO tackles efficient, unsupervised image-text matching by introducing a distribution-aware hashing framework. It leverages energy distance to quantify divergence between latent semantic distributions inferred from multiple augmented views, forming a robust instance-level similarity structure. The method combines Distribution-based Structural Mining with Collaborative Consistency Learning, optimizing a composite loss and using a differentiable proxy during training to yield modality-invariant binary codes. Empirically, DEMO achieves state-of-the-art MAP on MIRFlickr-25K, NUS-WIDE, and MS-COCO across 16–128 bit codes and offers favorable inference efficiency, demonstrating scalability for large-scale cross-modal retrieval.

Abstract

Image-text matching has been a long-standing problem, which seeks to connect vision and language through semantic understanding. Due to the capability to manage large-scale raw data, unsupervised hashing-based approaches have gained prominence recently. They typically construct a semantic similarity structure using the natural distance, which subsequently provides guidance to the model optimization process. However, the similarity structure could be biased at the boundaries of semantic distributions, causing error accumulation during sequential optimization. To tackle this, we introduce a novel hashing approach termed Distribution-based Structure Mining with Consistency Learning (DEMO) for efficient image-text matching. From a statistical view, DEMO characterizes each image using multiple augmented views, which are considered as samples drawn from its intrinsic semantic distribution. Then, we employ a non-parametric distribution divergence to ensure a robust and precise similarity structure. In addition, we introduce collaborative consistency learning which not only preserves the similarity structure in the Hamming space but also encourages consistency between retrieval distribution from different directions in a self-supervised manner. Through extensive experiments on three benchmark image-text matching datasets, we demonstrate that DEMO achieves superior performance compared with many state-of-the-art methods.

DEMO: A Statistical Perspective for Efficient Image-Text Matching

TL;DR

DEMO tackles efficient, unsupervised image-text matching by introducing a distribution-aware hashing framework. It leverages energy distance to quantify divergence between latent semantic distributions inferred from multiple augmented views, forming a robust instance-level similarity structure. The method combines Distribution-based Structural Mining with Collaborative Consistency Learning, optimizing a composite loss and using a differentiable proxy during training to yield modality-invariant binary codes. Empirically, DEMO achieves state-of-the-art MAP on MIRFlickr-25K, NUS-WIDE, and MS-COCO across 16–128 bit codes and offers favorable inference efficiency, demonstrating scalability for large-scale cross-modal retrieval.

Abstract

Image-text matching has been a long-standing problem, which seeks to connect vision and language through semantic understanding. Due to the capability to manage large-scale raw data, unsupervised hashing-based approaches have gained prominence recently. They typically construct a semantic similarity structure using the natural distance, which subsequently provides guidance to the model optimization process. However, the similarity structure could be biased at the boundaries of semantic distributions, causing error accumulation during sequential optimization. To tackle this, we introduce a novel hashing approach termed Distribution-based Structure Mining with Consistency Learning (DEMO) for efficient image-text matching. From a statistical view, DEMO characterizes each image using multiple augmented views, which are considered as samples drawn from its intrinsic semantic distribution. Then, we employ a non-parametric distribution divergence to ensure a robust and precise similarity structure. In addition, we introduce collaborative consistency learning which not only preserves the similarity structure in the Hamming space but also encourages consistency between retrieval distribution from different directions in a self-supervised manner. Through extensive experiments on three benchmark image-text matching datasets, we demonstrate that DEMO achieves superior performance compared with many state-of-the-art methods.
Paper Structure (22 sections, 18 equations, 8 figures, 3 tables, 1 algorithm)

This paper contains 22 sections, 18 equations, 8 figures, 3 tables, 1 algorithm.

Figures (8)

  • Figure 1: Comparison between cosine distance and energy distance. We leverage the randomness of data augmentation to estimate the latent semantics distributions, and then use energy distance between distributions as a substitute for cosine distance between data points.
  • Figure 2: An overview of our proposed DEMO. DEMO first calculates the energy distance between latent semantics distributions to generate an instance similarity matrix. Then DEMO simultaneously optimizes the modality-specific hashing networks by preserving the similarity with guided consistency learning. In addition, retrieval distributions using both image and text queries are encouraged to be consistent to obtain modality-invariant binary codes.
  • Figure 3: The Precision-Recall curve, Precision-top N curve, and Recall-top N curve with 128 bits on MIRFlickr-25K. The first row plots image-to-text results, and the second row plots text-to-image results.
  • Figure 4: Sensitivity analysis of sampling times $M$ and threshold $\tau$ with 16 bits on MIRFlickr-25K.
  • Figure 5: The t-SNE visualization with 128 bits on the MIRFlickr-25K. The image modality is colored red, and the text modality is colored green. The overlap degree represents the degree of modality-invariant hash codes.
  • ...and 3 more figures

Theorems & Definitions (1)

  • Definition 1