AgMMU: A Comprehensive Agricultural Multimodal Understanding Benchmark

Aruna Gauba; Irene Pi; Yunze Man; Ziqi Pang; Vikram S. Adve; Yu-Xiong Wang

AgMMU: A Comprehensive Agricultural Multimodal Understanding Benchmark

Aruna Gauba, Irene Pi, Yunze Man, Ziqi Pang, Vikram S. Adve, Yu-Xiong Wang

TL;DR

AgMMU presents a real-world, domain-specific benchmark for vision-language models in agriculture by leveraging 116k farmer–expert dialogues to create 746 MCQs and 746 OEQs, supported by AgBase with 57,079 multimodal facts. The authors design a four-stage curation pipeline (categorization, knowledge extraction, QA generation, human verification) to produce a balanced, high-quality evaluation set and a large development corpus. A broad suite of models is evaluated in zero-shot and finetuned regimes, revealing substantial gaps in knowledge grounding and perception, with open-source models often lagging behind closed-source counterparts. Fine-tuning on AgBase yields notable gains (up to 11.6% on OEQs), underscoring the value of domain-specific data for improving domain-specific VLM performance and motivating future knowledge integration and retrieval strategies in agriculture AI.

Abstract

We present AgMMU, a challenging real-world benchmark for evaluating and advancing vision-language models (VLMs) in the knowledge-intensive domain of agriculture. Unlike prior datasets that rely on crowdsourced prompts, AgMMU is distilled from 116,231 authentic dialogues between everyday growers and USDA-authorized Cooperative Extension experts. Through a three-stage pipeline: automated knowledge extraction, QA generation, and human verification, we construct (i) AgMMU, an evaluation set of 746 multiple-choice questions (MCQs) and 746 open-ended questions (OEQs), and (ii) AgBase, a development corpus of 57,079 multimodal facts covering five high-stakes agricultural topics: insect identification, species identification, disease categorization, symptom description, and management instruction. Benchmarking 12 leading VLMs reveals pronounced gaps in fine-grained perception and factual grounding. Open-sourced models trail after proprietary ones by a wide margin. Simple fine-tuning on AgBase boosts open-sourced model performance on challenging OEQs for up to 11.6% on average, narrowing this gap and also motivating future research to propose better strategies in knowledge extraction and distillation from AgBase. We hope AgMMU stimulates research on domain-specific knowledge integration and trustworthy decision support in agriculture AI development.

AgMMU: A Comprehensive Agricultural Multimodal Understanding Benchmark

TL;DR

Abstract

AgMMU: A Comprehensive Agricultural Multimodal Understanding Benchmark

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (18)