Table of Contents
Fetching ...

Multi-modal Reference Learning for Fine-grained Text-to-Image Retrieval

Zehong Ma, Hao Chen, Wei Zeng, Limin Su, Shiliang Zhang

TL;DR

This paper tackles the problem of fine-grained text-to-image retrieval under textual ambiguity by introducing MMRef, a multi-modal reference learning framework. MMRef constructs a learnable, object-level multi-modal reference via a Global Fusion and Local Reconstruction pipeline, and then guides uni-modal representation learning through a Reference-Guided Representation Learning stage. At inference, a reference-based refinement step projects visual and textual features into a shared reference space to compute a reference-based similarity that complements the initial cross-modal alignment, yielding refined top-k results. Extensive experiments on five fine-grained retrieval datasets demonstrate state-of-the-art performance and show improved domain generalization when textual descriptions are leveraged, highlighting the practical impact of using comprehensive cross-modal references to handle noisy or incomplete annotations.

Abstract

Fine-grained text-to-image retrieval aims to retrieve a fine-grained target image with a given text query. Existing methods typically assume that each training image is accurately depicted by its textual descriptions. However, textual descriptions can be ambiguous and fail to depict discriminative visual details in images, leading to inaccurate representation learning. To alleviate the effects of text ambiguity, we propose a Multi-Modal Reference learning framework to learn robust representations. We first propose a multi-modal reference construction module to aggregate all visual and textual details of the same object into a comprehensive multi-modal reference. The multi-modal reference hence facilitates the subsequent representation learning and retrieval similarity computation. Specifically, a reference-guided representation learning module is proposed to use multi-modal references to learn more accurate visual and textual representations. Additionally, we introduce a reference-based refinement method that employs the object references to compute a reference-based similarity that refines the initial retrieval results. Extensive experiments are conducted on five fine-grained text-to-image retrieval datasets for different text-to-image retrieval tasks. The proposed method has achieved superior performance over state-of-the-art methods. For instance, on the text-to-person image retrieval dataset RSTPReid, our method achieves the Rank1 accuracy of 56.2\%, surpassing the recent CFine by 5.6\%.

Multi-modal Reference Learning for Fine-grained Text-to-Image Retrieval

TL;DR

This paper tackles the problem of fine-grained text-to-image retrieval under textual ambiguity by introducing MMRef, a multi-modal reference learning framework. MMRef constructs a learnable, object-level multi-modal reference via a Global Fusion and Local Reconstruction pipeline, and then guides uni-modal representation learning through a Reference-Guided Representation Learning stage. At inference, a reference-based refinement step projects visual and textual features into a shared reference space to compute a reference-based similarity that complements the initial cross-modal alignment, yielding refined top-k results. Extensive experiments on five fine-grained retrieval datasets demonstrate state-of-the-art performance and show improved domain generalization when textual descriptions are leveraged, highlighting the practical impact of using comprehensive cross-modal references to handle noisy or incomplete annotations.

Abstract

Fine-grained text-to-image retrieval aims to retrieve a fine-grained target image with a given text query. Existing methods typically assume that each training image is accurately depicted by its textual descriptions. However, textual descriptions can be ambiguous and fail to depict discriminative visual details in images, leading to inaccurate representation learning. To alleviate the effects of text ambiguity, we propose a Multi-Modal Reference learning framework to learn robust representations. We first propose a multi-modal reference construction module to aggregate all visual and textual details of the same object into a comprehensive multi-modal reference. The multi-modal reference hence facilitates the subsequent representation learning and retrieval similarity computation. Specifically, a reference-guided representation learning module is proposed to use multi-modal references to learn more accurate visual and textual representations. Additionally, we introduce a reference-based refinement method that employs the object references to compute a reference-based similarity that refines the initial retrieval results. Extensive experiments are conducted on five fine-grained text-to-image retrieval datasets for different text-to-image retrieval tasks. The proposed method has achieved superior performance over state-of-the-art methods. For instance, on the text-to-person image retrieval dataset RSTPReid, our method achieves the Rank1 accuracy of 56.2\%, surpassing the recent CFine by 5.6\%.

Paper Structure

This paper contains 20 sections, 14 equations, 13 figures, 9 tables.

Figures (13)

  • Figure 1: Illustration of textual ambiguity and our motivation. Two images of the same person and their textual descriptions are illustrated, where the red ellipse shows an inaccurate annotation and the green ellipse highlights the discriminative detail that is missed in the second textual description. Our motivation is to construct a comprehensive multi-modal reference that encompasses all the details of a target object to guide learning better visual and textual representations.
  • Figure 2: (a) Overview of MMRef framework. The loss $\mathcal{L}_{\text{Align}}$ is used to align global visual features and textual features. The multi-modal references are constructed in the multi-modal reference construction (MMRC) with a global fusion (GF) module and a local reconstruction (LR) module. In reference-guided representation learning (RGRL), multi-modal references are utilized to facilitate learning better uni-modal representations. (b) The illustration of global fusion for a single reference. (c) The pipeline of local reconstruction for one reference.
  • Figure 3: Illustration of reference-based similarity in reference space. Textual or visual features are projected into a shared reference space, where modality-agnostic semantics are preserved and modality-specific noises are discarded. The reference-based similarity is utilized to refine the initial similarity.
  • Figure 4: Visualization of multi-modal reference. (a) “Raw Text” denotes the original caption. “Reference Text” is a semantic caption of the multi-modal reference, which is more complete and accurate. Textual phrases in red color are descriptions that do not appear in the raw text. (b) For each identity, “V2I” illustrates the attention of visual representation on the given image. “Ref2I” denotes the attention of reference embedding on the image. “T2I” is the text-to-image attention of raw text, and “RefText2I” is the attention of feature extracted from the reference text generated by captioning the reference embedding. Both “Ref2I” and “RefText2I” demonstrate that our reference encompasses more meaningful details of the person.
  • Figure 5: Attention visualization of references constructed using global fusion (GF) or a combination of global fusion and local reconstruction (GF+LR).
  • ...and 8 more figures