CFIR: Fast and Effective Long-Text To Image Retrieval for Large Corpora

Zijun Long; Xuri Ge; Richard Mccreadie; Joemon Jose

CFIR: Fast and Effective Long-Text To Image Retrieval for Large Corpora

Zijun Long, Xuri Ge, Richard Mccreadie, Joemon Jose

TL;DR

CFIR tackles large-scale long-text to image retrieval by integrating a two-stage coarse-to-fine retrieval pipeline with an entity-based coarse filter and a summary-based fine re-ranking. It introduces Decoupling-BEiT-3 to enable vector-based similarity and relies on a shared image embedding cache and a precomputed entity index to drastically reduce online computation. On the AToMiC LLIR dataset, CFIR achieves up to 11.06% recall improvement at 1000 and cuts training and retrieval times by approximately 68.75% and 99.79%, respectively. The approach demonstrates strong scalability and practical impact for real-world large-scale LLIR tasks, enabling fast, accurate retrieval over massive corpora of long documents and images.

Abstract

Text-to-image retrieval aims to find the relevant images based on a text query, which is important in various use-cases, such as digital libraries, e-commerce, and multimedia databases. Although Multimodal Large Language Models (MLLMs) demonstrate state-of-the-art performance, they exhibit limitations in handling large-scale, diverse, and ambiguous real-world needs of retrieval, due to the computation cost and the injective embeddings they produce. This paper presents a two-stage Coarse-to-Fine Index-shared Retrieval (CFIR) framework, designed for fast and effective large-scale long-text to image retrieval. The first stage, Entity-based Ranking (ER), adapts to long-text query ambiguity by employing a multiple-queries-to-multiple-targets paradigm, facilitating candidate filtering for the next stage. The second stage, Summary-based Re-ranking (SR), refines these rankings using summarized queries. We also propose a specialized Decoupling-BEiT-3 encoder, optimized for handling ambiguous user needs and both stages, which also enhances computational efficiency through vector-based similarity inference. Evaluation on the AToMiC dataset reveals that CFIR surpasses existing MLLMs by up to 11.06% in Recall@1000, while reducing training and retrieval times by 68.75% and 99.79%, respectively. We will release our code to facilitate future research at https://github.com/longkukuhi/CFIR.

CFIR: Fast and Effective Long-Text To Image Retrieval for Large Corpora

TL;DR

Abstract

Paper Structure (19 sections, 6 figures, 4 tables)

This paper contains 19 sections, 6 figures, 4 tables.

Introduction
Related Work
Small-scale and Caption-based Text-to-Image Retrieval
Large-scale Text-to-Image Retrieval datasets
Challenges in MLLM-based approaches
Methodology
The Proposed Decoupling-BEiT-3
Entity-based Ranking (ER)
Summary-Based Re-ranking (SR)
Experimental Setup
Experiments
What is the Benefit and Cost of Freezing the Image Encoder? (RQ1)
How does CFIR perform compared to state-of-the-art approaches? (RQ2)
CFIR scalability in AToMiC Large Setting (RQ3)
Analysis
...and 4 more sections

Figures (6)

Figure 1: The left is a plot of the average text tokens between MSCOCO short sentences, AToMiC long documents and their summaries and the right is the number of training and testing images of MSCOCO and AToMiC.
Figure 2: The overall architecture of the proposed CFIR for large-scale document-to-image retrieval.
Figure 3: CFIR Training Procedure
Figure 4: CFIR Retrieval (testing) Procedure
Figure 5: The demonstration of differences between the original architecture of BEiT-3 model and our Decoupling-BEiT-3.
...and 1 more figures

CFIR: Fast and Effective Long-Text To Image Retrieval for Large Corpora

TL;DR

Abstract

CFIR: Fast and Effective Long-Text To Image Retrieval for Large Corpora

Authors

TL;DR

Abstract

Table of Contents

Figures (6)