PinPoint: Evaluation of Composed Image Retrieval with Explicit Negatives, Multi-Image Queries, and Paraphrase Testing

Rohan Mahadev; Joyce Yuan; Patrick Poirson; David Xue; Hao-Yu Wu; Dmitry Kislyuk

PinPoint: Evaluation of Composed Image Retrieval with Explicit Negatives, Multi-Image Queries, and Paraphrase Testing

Rohan Mahadev, Joyce Yuan, Patrick Poirson, David Xue, Hao-Yu Wu, Dmitry Kislyuk

TL;DR

A training-free reranking method based on an off-the-shelf MLLM that can be applied to any existing system to bridge the gap between single ground-truth answers and annotations is proposed.

Abstract

Composed Image Retrieval (CIR) has made significant progress, yet current benchmarks are limited to single ground-truth answers and lack the annotations needed to evaluate false positive avoidance, robustness and multi-image reasoning. We present PinPoint, a comprehensive real world benchmark with 7,635 queries and 329K relevance judgments across 23 query categories. PinPoint advances the field by providing: (1) multiple correct answers (averaging 9.1 per query) (2) explicit hard negatives, (3) six instruction paraphrases per query for robustness testing, (4) multi-image composition support (13.4% of queries), and (5) demographic metadata for fairness evaluation. Based on our analysis of 20+ methods across 4 different major paradigms, we uncover three significant drawbacks: The best methods while achieving mAP@10 of 28.5%, still retrieves irrelevant results (hard negatives) 9% of the time. The best models also exhibit 25.1% performance variation across paraphrases, indicating significant potential for enhancing current CIR techniques. Multi-image queries performs 40 to 70% worse across different methods. To overcome these new issues uncovered by our evaluation framework, we propose a training-free reranking method based on an off-the-shelf MLLM that can be applied to any existing system to bridge the gap. We release the complete dataset, including all images, queries, annotations, retrieval index, and benchmarking code.

PinPoint: Evaluation of Composed Image Retrieval with Explicit Negatives, Multi-Image Queries, and Paraphrase Testing

TL;DR

A training-free reranking method based on an off-the-shelf MLLM that can be applied to any existing system to bridge the gap between single ground-truth answers and annotations is proposed.

Abstract

Paper Structure (35 sections, 4 equations, 8 figures, 4 tables)

This paper contains 35 sections, 4 equations, 8 figures, 4 tables.

Introduction
Related Work
CIR Datasets and Benchmarks
Major Families of CIR Methods
Post Retrieval Ranking
PinPoint Dataset
Dataset Construction
Generating Diverse Modification Instructions
Paraphrase Generation for Robustness Evaluation
Multi-Answer Annotation with Explicit Negatives
LLM Bias in Dataset Construction
Distributions
Evaluation
Metrics
$\Delta$mAP@10
...and 20 more sections

Figures (8)

Figure 1: Example single image query from PinPoint demonstrating multiple instruction paraphrases, multiple ground truths (green), and explicit hard negatives (red)
Figure 2: Multi-image composition query (13.4% of PinPoint) requiring cross-image attribute extraction
Figure 3: Metric pitfall: Recall@10 = 1.0 yet 8 / 10 results violate the colour/material constraint (Precision@10 = 0.20, Neg@10 = 0.60).
Figure 4: Dataset Construction Flow
Figure 5: PinPoint distributions. Left-to-right: (a) query domain categorization; (b) instruction type mix; (c) Skin Tone buckets for people-containing queries.
...and 3 more figures

PinPoint: Evaluation of Composed Image Retrieval with Explicit Negatives, Multi-Image Queries, and Paraphrase Testing

TL;DR

Abstract

PinPoint: Evaluation of Composed Image Retrieval with Explicit Negatives, Multi-Image Queries, and Paraphrase Testing

Authors

TL;DR

Abstract

Table of Contents

Figures (8)