Table of Contents
Fetching ...

INQUIRE: A Natural World Text-to-Image Retrieval Benchmark

Edward Vendrow, Omiros Pantazis, Alexander Shepard, Gabriel Brostow, Kate E. Jones, Oisin Mac Aodha, Sara Beery, Grant Van Horn

TL;DR

INQUIRE is introduced, a text-to-image retrieval benchmark designed to challenge multimodal vision-language models on expert-level queries, and it is shown that reranking with more powerful multimodal models can enhance retrieval performance, yet there remains a significant margin for improvement.

Abstract

We introduce INQUIRE, a text-to-image retrieval benchmark designed to challenge multimodal vision-language models on expert-level queries. INQUIRE includes iNaturalist 2024 (iNat24), a new dataset of five million natural world images, along with 250 expert-level retrieval queries. These queries are paired with all relevant images comprehensively labeled within iNat24, comprising 33,000 total matches. Queries span categories such as species identification, context, behavior, and appearance, emphasizing tasks that require nuanced image understanding and domain expertise. Our benchmark evaluates two core retrieval tasks: (1) INQUIRE-Fullrank, a full dataset ranking task, and (2) INQUIRE-Rerank, a reranking task for refining top-100 retrievals. Detailed evaluation of a range of recent multimodal models demonstrates that INQUIRE poses a significant challenge, with the best models failing to achieve an mAP@50 above 50%. In addition, we show that reranking with more powerful multimodal models can enhance retrieval performance, yet there remains a significant margin for improvement. By focusing on scientifically-motivated ecological challenges, INQUIRE aims to bridge the gap between AI capabilities and the needs of real-world scientific inquiry, encouraging the development of retrieval systems that can assist with accelerating ecological and biodiversity research. Our dataset and code are available at https://inquire-benchmark.github.io

INQUIRE: A Natural World Text-to-Image Retrieval Benchmark

TL;DR

INQUIRE is introduced, a text-to-image retrieval benchmark designed to challenge multimodal vision-language models on expert-level queries, and it is shown that reranking with more powerful multimodal models can enhance retrieval performance, yet there remains a significant margin for improvement.

Abstract

We introduce INQUIRE, a text-to-image retrieval benchmark designed to challenge multimodal vision-language models on expert-level queries. INQUIRE includes iNaturalist 2024 (iNat24), a new dataset of five million natural world images, along with 250 expert-level retrieval queries. These queries are paired with all relevant images comprehensively labeled within iNat24, comprising 33,000 total matches. Queries span categories such as species identification, context, behavior, and appearance, emphasizing tasks that require nuanced image understanding and domain expertise. Our benchmark evaluates two core retrieval tasks: (1) INQUIRE-Fullrank, a full dataset ranking task, and (2) INQUIRE-Rerank, a reranking task for refining top-100 retrievals. Detailed evaluation of a range of recent multimodal models demonstrates that INQUIRE poses a significant challenge, with the best models failing to achieve an mAP@50 above 50%. In addition, we show that reranking with more powerful multimodal models can enhance retrieval performance, yet there remains a significant margin for improvement. By focusing on scientifically-motivated ecological challenges, INQUIRE aims to bridge the gap between AI capabilities and the needs of real-world scientific inquiry, encouraging the development of retrieval systems that can assist with accelerating ecological and biodiversity research. Our dataset and code are available at https://inquire-benchmark.github.io

Paper Structure

This paper contains 40 sections, 5 equations, 13 figures, 10 tables.

Figures (13)

  • Figure 1: Inquire is a text-to-image retrieval benchmark of 250 expert-level queries comprehensively labeled over a new five million image dataset. The queries span a range of ecological and biodiversity concepts, requiring reasoning, image understanding, and domain expertise.
  • Figure 2: Category breakdown for the fine-grained queries that make up Inquire. Each query category falls under one of the following supercategories: Species, Context, Behavior, or Appearance.
  • Figure 3: Proportion of queries in Inquire associated with each iconic group of species.
  • Figure 4: The Inquire benchmark consists of a full-dataset ranking task and a reranking task targeting different aspects of the image retrieval problem.
  • Figure 5: Left: CLIP zero-shot retrieval performance across supercategories using an identical backbone (ViT-B/16) trained or fine-tuned on different datasets. We see how training datasets have a significant effect on final performance, e.g., BioCLIP is tuned on natural world data at the expense of forgetting other categories. Right: CLIP retrieval performance of models trained on DFN fang2023data.
  • ...and 8 more figures