Cross-Modal and Uni-Modal Soft-Label Alignment for Image-Text Retrieval

Hailang Huang; Zhijie Nie; Ziqiao Wang; Ziyu Shang

Cross-Modal and Uni-Modal Soft-Label Alignment for Image-Text Retrieval

Hailang Huang, Zhijie Nie, Ziqiao Wang, Ziyu Shang

TL;DR

This work tackles two key bottlenecks in image-text retrieval: inter-modal matching gaps and intra-modal semantic gaps. It proposes Cross-modal and Uni-modal Soft-label Alignment (CUSA), which leverages uni-modal teacher models to generate soft-label supervision and introduces CSA and USA to regularize cross-modal and intra-modal alignment, respectively, while preserving existing architectures. The training objective combines the original cross-modal loss with KL-divergence-based CSA and USA terms, controlled by hyperparameters $\alpha$ and $\beta$. Empirical results across MSCOCO, Flickr30K, ECCV Caption, and uni-modal datasets show consistent improvements and state-of-the-art performance, plus notable gains in uni-modal retrieval, indicating robust, universal retrieval capabilities in practical scenarios.

Abstract

Current image-text retrieval methods have demonstrated impressive performance in recent years. However, they still face two problems: the inter-modal matching missing problem and the intra-modal semantic loss problem. These problems can significantly affect the accuracy of image-text retrieval. To address these challenges, we propose a novel method called Cross-modal and Uni-modal Soft-label Alignment (CUSA). Our method leverages the power of uni-modal pre-trained models to provide soft-label supervision signals for the image-text retrieval model. Additionally, we introduce two alignment techniques, Cross-modal Soft-label Alignment (CSA) and Uni-modal Soft-label Alignment (USA), to overcome false negatives and enhance similarity recognition between uni-modal samples. Our method is designed to be plug-and-play, meaning it can be easily applied to existing image-text retrieval models without changing their original architectures. Extensive experiments on various image-text retrieval models and datasets, we demonstrate that our method can consistently improve the performance of image-text retrieval and achieve new state-of-the-art results. Furthermore, our method can also boost the uni-modal retrieval performance of image-text retrieval models, enabling it to achieve universal retrieval. The code and supplementary files can be found at https://github.com/lerogo/aaai24_itr_cusa.

Cross-Modal and Uni-Modal Soft-Label Alignment for Image-Text Retrieval

TL;DR

and

. Empirical results across MSCOCO, Flickr30K, ECCV Caption, and uni-modal datasets show consistent improvements and state-of-the-art performance, plus notable gains in uni-modal retrieval, indicating robust, universal retrieval capabilities in practical scenarios.

Abstract

Paper Structure (22 sections, 1 theorem, 9 equations, 5 figures, 5 tables)

This paper contains 22 sections, 1 theorem, 9 equations, 5 figures, 5 tables.

Introduction
Related Work
Image-Text Retrieval
Alignmemt with Soft-label
Method
Preliminaries
Feature Extraction
Cross-modal Soft-label Alignment
Uni-modal Soft-label Alignment
Training Objective
Experiments
Experiment Setup
Datasets
Evaluation Metrics
Implementation Details
...and 7 more sections

Key Result

Proposition 1

Cross-modal alignment alone is not sufficient for optimal recognition of similar samples.Please refer to Appendix A for the proof: https://github.com/lerogo/aaai24_itr_cusa

Figures (5)

Figure 1: Illustration of our approach. We use soft-labels $r(\cdot, \cdot)$ generated by uni-modal teacher models as a supervisory signal to guide cross-modal alignment and uni-modal alignment for image-text retrieval models.
Figure 2: Illustration of our proposed CUSA. It involves an ITR model used for training and a non-training uni-modal teacher model that provides soft-label supervision signals. The CSA method optimizes cross-modal logits, while the USA method optimizes uni-modal logits.
Figure 3: (a) Models ignoring intra-modal alignment tend to obtain feature distributions on the hypersphere; (b) After adding the USA term, the model tends to obtain feature distributions on the hypersphere.
Figure 4: Visualization of features generated from 5000 randomly selected image-text pairs from the MSCOCO test set. (a) represents the visualization of image features, while (b) represents the visualization of text features.
Figure 5: Case study: the green texts or boxes represent the same as the ground-truth, while the red ones do not.

Theorems & Definitions (1)

Proposition 1

Cross-Modal and Uni-Modal Soft-Label Alignment for Image-Text Retrieval

TL;DR

Abstract

Cross-Modal and Uni-Modal Soft-Label Alignment for Image-Text Retrieval

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (1)