Table of Contents
Fetching ...

FiCo-ITR: bridging fine-grained and coarse-grained image-text retrieval for comparative performance analysis

Mikel Williams-Lekuona, Georgina Cosma

TL;DR

The FiCo-ITR library is introduced, which standardises evaluation methodologies for both FG and CG models, facilitating direct comparisons and offering new insights into the performance-efficiency trade-offs between recent representative FG and CG models, highlighting their respective strengths and limitations.

Abstract

In the field of Image-Text Retrieval (ITR), recent advancements have leveraged large-scale Vision-Language Pretraining (VLP) for Fine-Grained (FG) instance-level retrieval, achieving high accuracy at the cost of increased computational complexity. For Coarse-Grained (CG) category-level retrieval, prominent approaches employ Cross-Modal Hashing (CMH) to prioritise efficiency, albeit at the cost of retrieval performance. Due to differences in methodologies, FG and CG models are rarely compared directly within evaluations in the literature, resulting in a lack of empirical data quantifying the retrieval performance-efficiency tradeoffs between the two. This paper addresses this gap by introducing the \texttt{FiCo-ITR} library, which standardises evaluation methodologies for both FG and CG models, facilitating direct comparisons. We conduct empirical evaluations of representative models from both subfields, analysing precision, recall, and computational complexity across varying data scales. Our findings offer new insights into the performance-efficiency trade-offs between recent representative FG and CG models, highlighting their respective strengths and limitations. These findings provide the foundation necessary to make more informed decisions regarding model selection for specific retrieval tasks and highlight avenues for future research into hybrid systems that leverage the strengths of both FG and CG approaches.

FiCo-ITR: bridging fine-grained and coarse-grained image-text retrieval for comparative performance analysis

TL;DR

The FiCo-ITR library is introduced, which standardises evaluation methodologies for both FG and CG models, facilitating direct comparisons and offering new insights into the performance-efficiency trade-offs between recent representative FG and CG models, highlighting their respective strengths and limitations.

Abstract

In the field of Image-Text Retrieval (ITR), recent advancements have leveraged large-scale Vision-Language Pretraining (VLP) for Fine-Grained (FG) instance-level retrieval, achieving high accuracy at the cost of increased computational complexity. For Coarse-Grained (CG) category-level retrieval, prominent approaches employ Cross-Modal Hashing (CMH) to prioritise efficiency, albeit at the cost of retrieval performance. Due to differences in methodologies, FG and CG models are rarely compared directly within evaluations in the literature, resulting in a lack of empirical data quantifying the retrieval performance-efficiency tradeoffs between the two. This paper addresses this gap by introducing the \texttt{FiCo-ITR} library, which standardises evaluation methodologies for both FG and CG models, facilitating direct comparisons. We conduct empirical evaluations of representative models from both subfields, analysing precision, recall, and computational complexity across varying data scales. Our findings offer new insights into the performance-efficiency trade-offs between recent representative FG and CG models, highlighting their respective strengths and limitations. These findings provide the foundation necessary to make more informed decisions regarding model selection for specific retrieval tasks and highlight avenues for future research into hybrid systems that leverage the strengths of both FG and CG approaches.
Paper Structure (14 sections, 2 equations, 8 figures, 7 tables)

This paper contains 14 sections, 2 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Comparison of Fine-Grained (FG) and Coarse-Grained (CG) Image-Text Retrieval approaches. FG search uses continuous embeddings, aiming to find the retrieval sample that specifically corresponds to the query sample. Under evaluation conditions, this involves finding the retrieval sample with the same ID as the query sample. CG search employs bitwise hash codes to find retrieval samples that are more broadly relevant to the query instead of exact matches. During evaluation, this involves finding any retrieval sample with at least one matching category label relative to the query's relevant category labels. The broader search criteria of CG search allows for more efficient computational costs, as seen in the comparison of encoding time and embedding storage costs of two representative FG and CG models (IMRAM chen2020imram and UCCH hu2022unsupervised)
  • Figure 2: An extendable framework of the FiCo-ITR library and toolkit. The pipeline consists of five main components: 1) Data Pre-Processing, which standardises dataset handling and offers optional label generation via Query2Label (Q2L) liu2021query2label for unlabeled datasets; 2) Model Encoding, supporting embeddings in the form of binary hash codes and continuous embeddings, as well as direct similarity matrices; 3) Similarity Measures, implementing four distance measures for uniform similarity calculation; 4) Retrieval Tasks, implementing instance- and category-level retrieval for both (i $\rightarrow$ t) and (t $\rightarrow$ i); and 5) Evaluation Metrics, reporting Recall@K for instance-level retrieval and mAP@K with P/R curves for category-level retrieval
  • Figure 3: MS-COCO Image $\rightarrow$ Text
  • Figure 4: MS-COCO Text $\rightarrow$ Image
  • Figure 5: Flickr30K Image $\rightarrow$ Text
  • ...and 3 more figures