Table of Contents
Fetching ...

MIRAGE: A Benchmark for Multimodal Information-Seeking and Reasoning in Agricultural Expert-Guided Conversations

Vardhan Dongre, Chi Gui, Shubham Garg, Hooshang Nayyeri, Gokhan Tur, Dilek Hakkani-Tür, Vikram S. Adve

TL;DR

MIRAGE addresses the need for realistic, domain-grounded evaluation of multimodal information-seeking and reasoning in agricultural expert conversations. It couples two benchmarks, MMST and MMMT, with real-world data, image-grounded contexts, and a robust LLM-judge evaluation protocol to assess identification, causal reasoning, clarification strategies, and long-form guidance. Key findings show a persistent gap between open-source LVLMs and GPT-4.x, limited gains from metadata, and substantial generalization challenges to unseen entities, motivating targeted future work in contextual grounding, long-tail entity handling, and interactive dialogue. The benchmark advances practical development of context-sensitive, knowledge-intensive VLMs for real-world agricultural decision support, while outlining limitations and directions for broader-domain extension and interactive capabilities.

Abstract

We introduce MIRAGE, a new benchmark for multimodal expert-level reasoning and decision-making in consultative interaction settings. Designed for the agriculture domain, MIRAGE captures the full complexity of expert consultations by combining natural user queries, expert-authored responses, and image-based context, offering a high-fidelity benchmark for evaluating models on grounded reasoning, clarification strategies, and long-form generation in a real-world, knowledge-intensive domain. Grounded in over 35,000 real user-expert interactions and curated through a carefully designed multi-step pipeline, MIRAGE spans diverse crop health, pest diagnosis, and crop management scenarios. The benchmark includes more than 7,000 unique biological entities, covering plant species, pests, and diseases, making it one of the most taxonomically diverse benchmarks available for vision-language models, grounded in the real world. Unlike existing benchmarks that rely on well-specified user inputs and closed-set taxonomies, MIRAGE features underspecified, context-rich scenarios with open-world settings, requiring models to infer latent knowledge gaps, handle rare entities, and either proactively guide the interaction or respond. Project Page: https://mirage-benchmark.github.io

MIRAGE: A Benchmark for Multimodal Information-Seeking and Reasoning in Agricultural Expert-Guided Conversations

TL;DR

MIRAGE addresses the need for realistic, domain-grounded evaluation of multimodal information-seeking and reasoning in agricultural expert conversations. It couples two benchmarks, MMST and MMMT, with real-world data, image-grounded contexts, and a robust LLM-judge evaluation protocol to assess identification, causal reasoning, clarification strategies, and long-form guidance. Key findings show a persistent gap between open-source LVLMs and GPT-4.x, limited gains from metadata, and substantial generalization challenges to unseen entities, motivating targeted future work in contextual grounding, long-tail entity handling, and interactive dialogue. The benchmark advances practical development of context-sensitive, knowledge-intensive VLMs for real-world agricultural decision support, while outlining limitations and directions for broader-domain extension and interactive capabilities.

Abstract

We introduce MIRAGE, a new benchmark for multimodal expert-level reasoning and decision-making in consultative interaction settings. Designed for the agriculture domain, MIRAGE captures the full complexity of expert consultations by combining natural user queries, expert-authored responses, and image-based context, offering a high-fidelity benchmark for evaluating models on grounded reasoning, clarification strategies, and long-form generation in a real-world, knowledge-intensive domain. Grounded in over 35,000 real user-expert interactions and curated through a carefully designed multi-step pipeline, MIRAGE spans diverse crop health, pest diagnosis, and crop management scenarios. The benchmark includes more than 7,000 unique biological entities, covering plant species, pests, and diseases, making it one of the most taxonomically diverse benchmarks available for vision-language models, grounded in the real world. Unlike existing benchmarks that rely on well-specified user inputs and closed-set taxonomies, MIRAGE features underspecified, context-rich scenarios with open-world settings, requiring models to infer latent knowledge gaps, handle rare entities, and either proactively guide the interaction or respond. Project Page: https://mirage-benchmark.github.io

Paper Structure

This paper contains 94 sections, 17 equations, 32 figures, 26 tables.

Figures (32)

  • Figure 1: An overview of the MIRAGE benchmark, detailing its components. The benchmark includes: (1) The Multimodal Singleturn (MMST) Benchmark, with 8,184 interactions featuring 6,856 biological entities across seven agronomic categories. (2) The Multimodal Multiturn (MMMT) Benchmark, a corpus of 861 dialogues for evaluating 'clarify-or-respond' decision-making. Additionally, MIRAGE contains the MMST Contextual Benchmark, a specialized single-turn set of 3,934 interactions where expert responses are related to time and location metadata.
  • Figure 2: LoRA fine-tuning results on identification and management tasks. Left: ID Accuracy (%) on seen entities and unseen entities for Qwen2.5‐VL‐3B at epochs: Instruct (0), LoRA‐ep‐2, 4, 6, 8. The grey line marker $\bullet$ traces the Reasoning Score (0–4 scale) on seen entities. Right: Finetuning performance on management task across four metrics.
  • Figure 3: Mean performance of closed-source and leading open-source LVLMs on the Standard-MG (Left) and Contextual-MG (Right) subsets. Each bar is the average over three judges ($\mathrm{Score}_{m,k} = \frac{1}{3}\sum_{r=1}^3 \mathrm{Judge}_{r}(m,k)$) for Accuracy ($\mathrm{Acc}$), Relevance ($\mathrm{Rel}$), Completeness ($\mathrm{Com}$), and Parsimony ($\mathrm{Par}$) on a 0–4 scale.
  • Figure 4: Filtered AskExtension data—(left) number of images per user question, (center) number of URLs per expert answer, and (right) distribution of total URL content length.
  • Figure 5: An Illustration of Data Curation Process for MIRAGE-MMST
  • ...and 27 more figures