Table of Contents
Fetching ...

Maybe you are looking for CroQS: Cross-modal Query Suggestion for Text-to-Image Retrieval

Giacomo Pacini, Fabio Carrara, Nicola Messina, Nicola Tonellotto, Giuseppe Amato, Fabrizio Falchi

TL;DR

This work formalizes cross-modal query suggestion, proposing a task that refines an initial natural-language query $q_0$ based on visual content from retrieved images, and introduces the CroQS benchmark to standardize evaluation. CroQS partitions $R(q_0, \mathcal{I})$ into semantic clusters $\{C_i\}$ and requires a per-cluster refined query $\hat{q}_i$, enabling objective comparison across methods. The authors adapt two captioning-based baselines (Prototype Captioning using a cluster prototype in CLIP space) and a group-captioning approach (GroupCap using LLMs) to generate $\hat{q}_i$, plus a query-aware variant of ClipCap and different LLM backbones. Across metrics for cluster specificity, representativeness, and similarity to $q_0$, captioning-based methods generally yield stronger cluster-specific descriptions while LLM-based methods offer balanced performance and higher similarity to the original query; human annotations remain the upper bound. The CroQS benchmark and accompanying baselines provide a foundation for advancing cross-modal query suggestion in text-to-image retrieval systems, with room for further improvements toward human parity.

Abstract

Query suggestion, a technique widely adopted in information retrieval, enhances system interactivity and the browsing experience of document collections. In cross-modal retrieval, many works have focused on retrieving relevant items from natural language queries, while few have explored query suggestion solutions. In this work, we address query suggestion in cross-modal retrieval, introducing a novel task that focuses on suggesting minimal textual modifications needed to explore visually consistent subsets of the collection, following the premise of ''Maybe you are looking for''. To facilitate the evaluation and development of methods, we present a tailored benchmark named CroQS. This dataset comprises initial queries, grouped result sets, and human-defined suggested queries for each group. We establish dedicated metrics to rigorously evaluate the performance of various methods on this task, measuring representativeness, cluster specificity, and similarity of the suggested queries to the original ones. Baseline methods from related fields, such as image captioning and content summarization, are adapted for this task to provide reference performance scores. Although relatively far from human performance, our experiments reveal that both LLM-based and captioning-based methods achieve competitive results on CroQS, improving the recall on cluster specificity by more than 115% and representativeness mAP by more than 52% with respect to the initial query. The dataset, the implementation of the baseline methods and the notebooks containing our experiments are available here: https://paciosoft.com/CroQS-benchmark/

Maybe you are looking for CroQS: Cross-modal Query Suggestion for Text-to-Image Retrieval

TL;DR

This work formalizes cross-modal query suggestion, proposing a task that refines an initial natural-language query based on visual content from retrieved images, and introduces the CroQS benchmark to standardize evaluation. CroQS partitions into semantic clusters and requires a per-cluster refined query , enabling objective comparison across methods. The authors adapt two captioning-based baselines (Prototype Captioning using a cluster prototype in CLIP space) and a group-captioning approach (GroupCap using LLMs) to generate , plus a query-aware variant of ClipCap and different LLM backbones. Across metrics for cluster specificity, representativeness, and similarity to , captioning-based methods generally yield stronger cluster-specific descriptions while LLM-based methods offer balanced performance and higher similarity to the original query; human annotations remain the upper bound. The CroQS benchmark and accompanying baselines provide a foundation for advancing cross-modal query suggestion in text-to-image retrieval systems, with room for further improvements toward human parity.

Abstract

Query suggestion, a technique widely adopted in information retrieval, enhances system interactivity and the browsing experience of document collections. In cross-modal retrieval, many works have focused on retrieving relevant items from natural language queries, while few have explored query suggestion solutions. In this work, we address query suggestion in cross-modal retrieval, introducing a novel task that focuses on suggesting minimal textual modifications needed to explore visually consistent subsets of the collection, following the premise of ''Maybe you are looking for''. To facilitate the evaluation and development of methods, we present a tailored benchmark named CroQS. This dataset comprises initial queries, grouped result sets, and human-defined suggested queries for each group. We establish dedicated metrics to rigorously evaluate the performance of various methods on this task, measuring representativeness, cluster specificity, and similarity of the suggested queries to the original ones. Baseline methods from related fields, such as image captioning and content summarization, are adapted for this task to provide reference performance scores. Although relatively far from human performance, our experiments reveal that both LLM-based and captioning-based methods achieve competitive results on CroQS, improving the recall on cluster specificity by more than 115% and representativeness mAP by more than 52% with respect to the initial query. The dataset, the implementation of the baseline methods and the notebooks containing our experiments are available here: https://paciosoft.com/CroQS-benchmark/

Paper Structure

This paper contains 24 sections, 6 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Cross-modal Query Suggestion. Given an initial query $q_0$ and an image collection $\mathcal{I}$, a cross-modal query suggestion system $\mathfrak{F}$ returns a set of query suggestions $\mathcal{Q}$ based on the visual content of the result set $R(q_0, \mathcal{I})$. Ideally, each suggestion $\hat{q}_i \in \mathcal{Q}$ should represent a semantically coherent group $C_i \subset R(q_0, \mathcal{I})$.
  • Figure 2: Architectures of the baseline methods proposed.
  • Figure 3: CroQS samples and predictions. Each panel reports a sample of a semantic cluster with its initial query $q_0$, its annotation, and the query suggested by the tested methods. Representative images (Eq. \ref{['eq:representativeness']}) have a colored outline.