Table of Contents
Fetching ...

ARK: A Dual-Axis Multimodal Retrieval Benchmark along Reasoning and Knowledge

Yijie Lin, Guofeng Ding, Haochen Zhou, Haobin Li, Mouxing Yang, Xi Peng

TL;DR

ARK presents a dual-axis benchmark to diagnose multimodal retrieval along knowledge-intensive and reasoning-intensive dimensions, crossing five knowledge domains with 17 subtypes and 16 visual data types. It pairs queries with hard negatives to deter shortcut matching and evaluates 23 retrievers to reveal persistent bottlenecks in visual grounding and spatial reasoning, even as model scale improves performance. Inference-time interventions such as query rewriting and re-ranking yield consistent gains, suggesting practical paths to enhance reasoning in retrieval without solely increasing capacity. The work provides a principled diagnostic tool and empirical insights to guide the development of more reasoning-aware, knowledge-grounded multimodal retrieval systems with broad domain coverage. ARK’s findings highlight the need for improved vision-centric reasoning capabilities to bridge gaps in fine-grained and spatial inference in multimodal contexts.

Abstract

Existing multimodal retrieval benchmarks largely emphasize semantic matching on daily-life images and offer limited diagnostics of professional knowledge and complex reasoning. To address this gap, we introduce ARK, a benchmark designed to analyze multimodal retrieval from two complementary perspectives: (i) knowledge domains (five domains with 17 subtypes), which characterize the content and expertise retrieval relies on, and (ii) reasoning skills (six categories), which characterize the type of inference over multimodal evidence required to identify the correct candidate. Specifically, ARK evaluates retrieval with both unimodal and multimodal queries and candidates, covering 16 heterogeneous visual data types. To avoid shortcut matching during evaluation, most queries are paired with targeted hard negatives that require multi-step reasoning. We evaluate 23 representative text-based and multimodal retrievers on ARK and observe a pronounced gap between knowledge-intensive and reasoning-intensive retrieval, with fine-grained visual and spatial reasoning emerging as persistent bottlenecks. We further show that simple enhancements such as re-ranking and rewriting yield consistent improvements, but substantial headroom remains.

ARK: A Dual-Axis Multimodal Retrieval Benchmark along Reasoning and Knowledge

TL;DR

ARK presents a dual-axis benchmark to diagnose multimodal retrieval along knowledge-intensive and reasoning-intensive dimensions, crossing five knowledge domains with 17 subtypes and 16 visual data types. It pairs queries with hard negatives to deter shortcut matching and evaluates 23 retrievers to reveal persistent bottlenecks in visual grounding and spatial reasoning, even as model scale improves performance. Inference-time interventions such as query rewriting and re-ranking yield consistent gains, suggesting practical paths to enhance reasoning in retrieval without solely increasing capacity. The work provides a principled diagnostic tool and empirical insights to guide the development of more reasoning-aware, knowledge-grounded multimodal retrieval systems with broad domain coverage. ARK’s findings highlight the need for improved vision-centric reasoning capabilities to bridge gaps in fine-grained and spatial inference in multimodal contexts.

Abstract

Existing multimodal retrieval benchmarks largely emphasize semantic matching on daily-life images and offer limited diagnostics of professional knowledge and complex reasoning. To address this gap, we introduce ARK, a benchmark designed to analyze multimodal retrieval from two complementary perspectives: (i) knowledge domains (five domains with 17 subtypes), which characterize the content and expertise retrieval relies on, and (ii) reasoning skills (six categories), which characterize the type of inference over multimodal evidence required to identify the correct candidate. Specifically, ARK evaluates retrieval with both unimodal and multimodal queries and candidates, covering 16 heterogeneous visual data types. To avoid shortcut matching during evaluation, most queries are paired with targeted hard negatives that require multi-step reasoning. We evaluate 23 representative text-based and multimodal retrievers on ARK and observe a pronounced gap between knowledge-intensive and reasoning-intensive retrieval, with fine-grained visual and spatial reasoning emerging as persistent bottlenecks. We further show that simple enhancements such as re-ranking and rewriting yield consistent improvements, but substantial headroom remains.
Paper Structure (26 sections, 20 figures, 21 tables)

This paper contains 26 sections, 20 figures, 21 tables.

Figures (20)

  • Figure 1: Evolution of multimodal retrieval. (a) Traditional benchmarks emphasize object-level or surface-level semantic similarity matching on everyday imagery. (b) Recent benchmarks shift toward knowledge-intensive retrieval, where relevance hinges on domain knowledge beyond surface matching. (c) ARK explicitly separates knowledge and reasoning as two evaluation axes, enabling diagnosis of retrieval that requires both domain knowledge and structured reasoning over multimodal evidence.
  • Figure 2: ARK is a dual-axis benchmark for multimodal retrieval. (a)–(c) summarize three key characteristics: (a) Comprehensive knowledge domains: it covers five broad knowledge domains and 17 fine-grained subtypes, (b) Heterogeneous visual types: queries and candidates span diverse visual formats beyond everyday imagery, and (c) Reasoning-oriented evaluation: each instance in ARK is first labeled by its overall cognitive demand as perception-only, knowledge-intensive, reasoning-intensive, or both knowledge- and reasoning-intensive. Reasoning-intensive instances are further annotated with six reasoning-skill categories, enabling fine-grained analysis of model behavior. Note that a single instance may involve multiple reasoning skills. (d) Examples: representative instances of each reasoning skill. For reference, the primary reasoning skills and visual data types for each knowledge subtype are summarized in Table \ref{['tab:ark_data_source']}.
  • Figure 3: Illustration of the data curation process for ARK.
  • Figure 4: Visualized example of Art subtype.
  • Figure 5: Visualized example of Biology subtype.
  • ...and 15 more figures