MIRAGE: A Benchmark for Multimodal Information-Seeking and Reasoning in Agricultural Expert-Guided Conversations

Vardhan Dongre; Chi Gui; Shubham Garg; Hooshang Nayyeri; Gokhan Tur; Dilek Hakkani-Tür; Vikram S. Adve

MIRAGE: A Benchmark for Multimodal Information-Seeking and Reasoning in Agricultural Expert-Guided Conversations

Vardhan Dongre, Chi Gui, Shubham Garg, Hooshang Nayyeri, Gokhan Tur, Dilek Hakkani-Tür, Vikram S. Adve

TL;DR

MIRAGE addresses the need for realistic, domain-grounded evaluation of multimodal information-seeking and reasoning in agricultural expert conversations. It couples two benchmarks, MMST and MMMT, with real-world data, image-grounded contexts, and a robust LLM-judge evaluation protocol to assess identification, causal reasoning, clarification strategies, and long-form guidance. Key findings show a persistent gap between open-source LVLMs and GPT-4.x, limited gains from metadata, and substantial generalization challenges to unseen entities, motivating targeted future work in contextual grounding, long-tail entity handling, and interactive dialogue. The benchmark advances practical development of context-sensitive, knowledge-intensive VLMs for real-world agricultural decision support, while outlining limitations and directions for broader-domain extension and interactive capabilities.

Abstract

We introduce MIRAGE, a new benchmark for multimodal expert-level reasoning and decision-making in consultative interaction settings. Designed for the agriculture domain, MIRAGE captures the full complexity of expert consultations by combining natural user queries, expert-authored responses, and image-based context, offering a high-fidelity benchmark for evaluating models on grounded reasoning, clarification strategies, and long-form generation in a real-world, knowledge-intensive domain. Grounded in over 35,000 real user-expert interactions and curated through a carefully designed multi-step pipeline, MIRAGE spans diverse crop health, pest diagnosis, and crop management scenarios. The benchmark includes more than 7,000 unique biological entities, covering plant species, pests, and diseases, making it one of the most taxonomically diverse benchmarks available for vision-language models, grounded in the real world. Unlike existing benchmarks that rely on well-specified user inputs and closed-set taxonomies, MIRAGE features underspecified, context-rich scenarios with open-world settings, requiring models to infer latent knowledge gaps, handle rare entities, and either proactively guide the interaction or respond. Project Page: https://mirage-benchmark.github.io

MIRAGE: A Benchmark for Multimodal Information-Seeking and Reasoning in Agricultural Expert-Guided Conversations

TL;DR

Abstract

MIRAGE: A Benchmark for Multimodal Information-Seeking and Reasoning in Agricultural Expert-Guided Conversations

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (32)