Table of Contents
Fetching ...

Flickr30K-CFQ: A Compact and Fragmented Query Dataset for Text-image Retrieval

Haoyu Liu, Yaoxian Song, Xuwu Wang, Zhu Xiangru, Zhixu Li, Wei Song, Tiefeng Li

TL;DR

This work addresses the mismatch between real-world, human-like query styles and traditional verbose vision-language benchmarks by introducing Flickr30K-CFQ, a four-granularity compact and fragmented query dataset built atop Flickr30K Entities. It couples this dataset with an LLM-based query-enhancement framework comprising a Query-enhanced Module and a Multi-query Retrieval Module that expand queries and perform multi-turn voting to boost cross-modal retrieval. Empirical results show that CFQ reveals insufficiencies in existing datasets, and the proposed method yields consistent improvements across public benchmarks and CFQ, including over 0.9% and 2.4% gains in targeted settings. The work delivers a more realistic benchmark for text-image retrieval and demonstrates the potential of prompt-driven, LLM-based query augmentation to enhance cross-modal alignment in practice.

Abstract

With the explosive growth of multi-modal information on the Internet, unimodal search cannot satisfy the requirement of Internet applications. Text-image retrieval research is needed to realize high-quality and efficient retrieval between different modalities. Existing text-image retrieval research is mostly based on general vision-language datasets (e.g. MS-COCO, Flickr30K), in which the query utterance is rigid and unnatural (i.e. verbosity and formality). To overcome the shortcoming, we construct a new Compact and Fragmented Query challenge dataset (named Flickr30K-CFQ) to model text-image retrieval task considering multiple query content and style, including compact and fine-grained entity-relation corpus. We propose a novel query-enhanced text-image retrieval method using prompt engineering based on LLM. Experiments show that our proposed Flickr30-CFQ reveals the insufficiency of existing vision-language datasets in realistic text-image tasks. Our LLM-based Query-enhanced method applied on different existing text-image retrieval models improves query understanding performance both on public dataset and our challenge set Flickr30-CFQ with over 0.9% and 2.4% respectively. Our project can be available anonymously in https://sites.google.com/view/Flickr30K-cfq.

Flickr30K-CFQ: A Compact and Fragmented Query Dataset for Text-image Retrieval

TL;DR

This work addresses the mismatch between real-world, human-like query styles and traditional verbose vision-language benchmarks by introducing Flickr30K-CFQ, a four-granularity compact and fragmented query dataset built atop Flickr30K Entities. It couples this dataset with an LLM-based query-enhancement framework comprising a Query-enhanced Module and a Multi-query Retrieval Module that expand queries and perform multi-turn voting to boost cross-modal retrieval. Empirical results show that CFQ reveals insufficiencies in existing datasets, and the proposed method yields consistent improvements across public benchmarks and CFQ, including over 0.9% and 2.4% gains in targeted settings. The work delivers a more realistic benchmark for text-image retrieval and demonstrates the potential of prompt-driven, LLM-based query augmentation to enhance cross-modal alignment in practice.

Abstract

With the explosive growth of multi-modal information on the Internet, unimodal search cannot satisfy the requirement of Internet applications. Text-image retrieval research is needed to realize high-quality and efficient retrieval between different modalities. Existing text-image retrieval research is mostly based on general vision-language datasets (e.g. MS-COCO, Flickr30K), in which the query utterance is rigid and unnatural (i.e. verbosity and formality). To overcome the shortcoming, we construct a new Compact and Fragmented Query challenge dataset (named Flickr30K-CFQ) to model text-image retrieval task considering multiple query content and style, including compact and fine-grained entity-relation corpus. We propose a novel query-enhanced text-image retrieval method using prompt engineering based on LLM. Experiments show that our proposed Flickr30-CFQ reveals the insufficiency of existing vision-language datasets in realistic text-image tasks. Our LLM-based Query-enhanced method applied on different existing text-image retrieval models improves query understanding performance both on public dataset and our challenge set Flickr30-CFQ with over 0.9% and 2.4% respectively. Our project can be available anonymously in https://sites.google.com/view/Flickr30K-cfq.
Paper Structure (23 sections, 4 figures, 5 tables, 1 algorithm)

This paper contains 23 sections, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: The overview of text-image models. (1) Previous: The query in existing datasets is verbose and global caption and the retrieval models unitize the query directly. (2) Ours: Our dataset contains four-level granularities corpus and proposed model uses LLMs to enhance the compact and fragmented query for subsequent retrieval.
  • Figure 2: The Construction of Flickr30K-CFQ. Our dataset provides four-level granularities query corpus: 1). Imagery Tag (abstract) is annotated by multi-modal LLM. 2). Phrase is inherited from Flickr30K Entities 3). Triple (entity & relation) is extracted from corpus. 4).Fragment (multiple triples) is generated by the fine-tuned T5 based on multiple SPO.
  • Figure 3: LLM-based Query-enhanced method including two modules. The first is Query-enhanced Module, which is used to expand the initial query to a query batch. The second is Multi-query Retrieval Module, in which two-stage similarity are calculated to obtain better retrieved results.
  • Figure 4: Visualization of retrieval results from LLM-based Query-enhanced method.