Table of Contents
Fetching ...

Highlighting What Matters: Promptable Embeddings for Attribute-Focused Image Retrieval

Siting Li, Xiang Gao, Simon Shaolei Du

TL;DR

Attribute-focused image retrieval aims to match images to queries that emphasize specific visual attributes rather than global semantics. The authors introduce COCO-Facet to benchmark this setting and demonstrate that both CLIP-like and current MLLM-based retrievers struggle with fine-grained attributes. They propose promptable image embeddings that condition on GPT-generated prompts to highlight target attributes, showing improved recall across diverse attribute types and image pools. To enable practical deployment, they propose pre-processing prompts and a test-time linear approximation, achieving significant recall gains with manageable compute. Overall, the work advances fine-grained, attribute-grounded T2I retrieval with tangible efficiency strategies and provides a benchmark and codebase for future research.

Abstract

While an image is worth more than a thousand words, only a few provide crucial information for a given task and thus should be focused on. In light of this, ideal text-to-image (T2I) retrievers should prioritize specific visual attributes relevant to queries. To evaluate current retrievers on handling attribute-focused queries, we build COCO-Facet, a COCO-based benchmark with 9,112 queries about diverse attributes of interest. We find that CLIP-like retrievers, which are widely adopted due to their efficiency and zero-shot ability, have poor and imbalanced performance, possibly because their image embeddings focus on global semantics and subjects while leaving out other details. Notably, we reveal that even recent Multimodal Large Language Model (MLLM)-based, stronger retrievers with a larger output dimension struggle with this limitation. Hence, we hypothesize that retrieving with general image embeddings is suboptimal for performing such queries. As a solution, we propose to use promptable image embeddings enabled by these multimodal retrievers, which boost performance by highlighting required attributes. Our pipeline for deriving such embeddings generalizes across query types, image pools, and base retriever architectures. To enhance real-world applicability, we offer two acceleration strategies: Pre-processing promptable embeddings and using linear approximations. We show that the former yields a 15% improvement in Recall@5 when prompts are predefined, while the latter achieves an 8% improvement when prompts are only available during inference.

Highlighting What Matters: Promptable Embeddings for Attribute-Focused Image Retrieval

TL;DR

Attribute-focused image retrieval aims to match images to queries that emphasize specific visual attributes rather than global semantics. The authors introduce COCO-Facet to benchmark this setting and demonstrate that both CLIP-like and current MLLM-based retrievers struggle with fine-grained attributes. They propose promptable image embeddings that condition on GPT-generated prompts to highlight target attributes, showing improved recall across diverse attribute types and image pools. To enable practical deployment, they propose pre-processing prompts and a test-time linear approximation, achieving significant recall gains with manageable compute. Overall, the work advances fine-grained, attribute-grounded T2I retrieval with tangible efficiency strategies and provides a benchmark and codebase for future research.

Abstract

While an image is worth more than a thousand words, only a few provide crucial information for a given task and thus should be focused on. In light of this, ideal text-to-image (T2I) retrievers should prioritize specific visual attributes relevant to queries. To evaluate current retrievers on handling attribute-focused queries, we build COCO-Facet, a COCO-based benchmark with 9,112 queries about diverse attributes of interest. We find that CLIP-like retrievers, which are widely adopted due to their efficiency and zero-shot ability, have poor and imbalanced performance, possibly because their image embeddings focus on global semantics and subjects while leaving out other details. Notably, we reveal that even recent Multimodal Large Language Model (MLLM)-based, stronger retrievers with a larger output dimension struggle with this limitation. Hence, we hypothesize that retrieving with general image embeddings is suboptimal for performing such queries. As a solution, we propose to use promptable image embeddings enabled by these multimodal retrievers, which boost performance by highlighting required attributes. Our pipeline for deriving such embeddings generalizes across query types, image pools, and base retriever architectures. To enhance real-world applicability, we offer two acceleration strategies: Pre-processing promptable embeddings and using linear approximations. We show that the former yields a 15% improvement in Recall@5 when prompts are predefined, while the latter achieves an 8% improvement when prompts are only available during inference.

Paper Structure

This paper contains 39 sections, 3 equations, 6 figures, 19 tables.

Figures (6)

  • Figure 1: Overview.(Above) We study the task of attribute-focused text-to-image retrieval and build COCO-Facet for benchmarking various retrievers. (Below) We show that using promptable image embeddings enhances performance on such queries, and propose two acceleration strategies to improve its applicability.
  • Figure 2: Average retrieval performance across various base retrievers on COCO-Facet, with and without GPT-generated prompts. The same set of prompts brings consistent performance gain on different multimodal base retrievers.
  • Figure 3: Recall@1 and Recall@5 of accelerated text-to-image retrieval using approximated promptable image embeddings with varying sample size $K$ in percentage points on COCO-Facet. The results are averaged over five independent runs. "Baseline" refers to using VLM2Vec-Phi-3.5-V without prompts.
  • Figure 4: Visualization of which image regions the models attend to for image-text matching. The query text is "Find me an image that contains any car."
  • Figure 5: Failed top-1 retrieval results of the text-based retrieval. The query is "Find me an image that contains any bird." in all cases.
  • ...and 1 more figures