Table of Contents
Fetching ...

AgMMU: A Comprehensive Agricultural Multimodal Understanding Benchmark

Aruna Gauba, Irene Pi, Yunze Man, Ziqi Pang, Vikram S. Adve, Yu-Xiong Wang

TL;DR

AgMMU presents a real-world, domain-specific benchmark for vision-language models in agriculture by leveraging 116k farmer–expert dialogues to create 746 MCQs and 746 OEQs, supported by AgBase with 57,079 multimodal facts. The authors design a four-stage curation pipeline (categorization, knowledge extraction, QA generation, human verification) to produce a balanced, high-quality evaluation set and a large development corpus. A broad suite of models is evaluated in zero-shot and finetuned regimes, revealing substantial gaps in knowledge grounding and perception, with open-source models often lagging behind closed-source counterparts. Fine-tuning on AgBase yields notable gains (up to 11.6% on OEQs), underscoring the value of domain-specific data for improving domain-specific VLM performance and motivating future knowledge integration and retrieval strategies in agriculture AI.

Abstract

We present AgMMU, a challenging real-world benchmark for evaluating and advancing vision-language models (VLMs) in the knowledge-intensive domain of agriculture. Unlike prior datasets that rely on crowdsourced prompts, AgMMU is distilled from 116,231 authentic dialogues between everyday growers and USDA-authorized Cooperative Extension experts. Through a three-stage pipeline: automated knowledge extraction, QA generation, and human verification, we construct (i) AgMMU, an evaluation set of 746 multiple-choice questions (MCQs) and 746 open-ended questions (OEQs), and (ii) AgBase, a development corpus of 57,079 multimodal facts covering five high-stakes agricultural topics: insect identification, species identification, disease categorization, symptom description, and management instruction. Benchmarking 12 leading VLMs reveals pronounced gaps in fine-grained perception and factual grounding. Open-sourced models trail after proprietary ones by a wide margin. Simple fine-tuning on AgBase boosts open-sourced model performance on challenging OEQs for up to 11.6% on average, narrowing this gap and also motivating future research to propose better strategies in knowledge extraction and distillation from AgBase. We hope AgMMU stimulates research on domain-specific knowledge integration and trustworthy decision support in agriculture AI development.

AgMMU: A Comprehensive Agricultural Multimodal Understanding Benchmark

TL;DR

AgMMU presents a real-world, domain-specific benchmark for vision-language models in agriculture by leveraging 116k farmer–expert dialogues to create 746 MCQs and 746 OEQs, supported by AgBase with 57,079 multimodal facts. The authors design a four-stage curation pipeline (categorization, knowledge extraction, QA generation, human verification) to produce a balanced, high-quality evaluation set and a large development corpus. A broad suite of models is evaluated in zero-shot and finetuned regimes, revealing substantial gaps in knowledge grounding and perception, with open-source models often lagging behind closed-source counterparts. Fine-tuning on AgBase yields notable gains (up to 11.6% on OEQs), underscoring the value of domain-specific data for improving domain-specific VLM performance and motivating future knowledge integration and retrieval strategies in agriculture AI.

Abstract

We present AgMMU, a challenging real-world benchmark for evaluating and advancing vision-language models (VLMs) in the knowledge-intensive domain of agriculture. Unlike prior datasets that rely on crowdsourced prompts, AgMMU is distilled from 116,231 authentic dialogues between everyday growers and USDA-authorized Cooperative Extension experts. Through a three-stage pipeline: automated knowledge extraction, QA generation, and human verification, we construct (i) AgMMU, an evaluation set of 746 multiple-choice questions (MCQs) and 746 open-ended questions (OEQs), and (ii) AgBase, a development corpus of 57,079 multimodal facts covering five high-stakes agricultural topics: insect identification, species identification, disease categorization, symptom description, and management instruction. Benchmarking 12 leading VLMs reveals pronounced gaps in fine-grained perception and factual grounding. Open-sourced models trail after proprietary ones by a wide margin. Simple fine-tuning on AgBase boosts open-sourced model performance on challenging OEQs for up to 11.6% on average, narrowing this gap and also motivating future research to propose better strategies in knowledge extraction and distillation from AgBase. We hope AgMMU stimulates research on domain-specific knowledge integration and trustworthy decision support in agriculture AI development.

Paper Structure

This paper contains 25 sections, 18 figures, 2 tables.

Figures (18)

  • Figure 1: AgMMU is a multimodal agricultural dataset that challenges vision-language models (VLMs) to observe the details of images and provide factually precise answers. Derived from real-world conversations between users and authorized experts by USDA-funded Cooperative Extension, AgMMU covers five major agricultural knowledge types (demonstrated in five columns of the figure). AgMMU features 746 multiple-choice questions (MCQs) like conventional vision-language benchmarks yue2024mmmu and the same number of open-ended questions (OEQs) like SimpleQA wei2024simpleqa, all validated by human annotators. We also curate an agricultural knowledge base with 57,079 pieces of information for foundation model fine-tuning, extracted from experts' answers. AgMMU can benefit both knowledge-intensive VLMs and the social good of agriculture.
  • Figure 2: Starting from raw user-expert conversations, we design a four-step data curation pipeline with carefully designed prompts and human verification. (1) We employ LLaMA-70B to categorize the conversation and filter out the samples that do not fall under our selected agriculture sub-domains. (2) A larger LLaMA-405B model extracts key agricultural knowledge from the long-form user-expert conversation. (3) These facts either go to the evaluation set or the development set. For evaluation questions, we utilize GPT4o to format the original QA and agricultural knowledge into multiple-choice questions (MCQs) and open-ended questions (OEQs). (4) Finally, human annotators verify the quality of the questions and only keep the qualified ones in the evaluation set.
  • Figure 3: Statistics of AgMMU. (1) The agricultural sub-domain distribution of our raw dataset, after the categorization step, as explained in Figure \ref{['fig:dataset_curation']}. (2) AgMMU, after the knowledge extraction and evaluation curation steps, serves as a balanced subset of raw dataset with proportional representation across knowledge types.
  • Figure 4: (a) The most common errors made by VLMs are knowledge errors. CoT represents samples that are originally wrong due to false reasoning, but corrected with chain-of-thought prompting. IPI, DII, SVD, SI, and MI are short for five question types as explained in Sec \ref{['sec:data_curation']}. (b) We show two common evaluation errors, including the lack of knowledge to answer the question (top), and the wrong perception of the image (bottom).
  • Figure 5: The effectiveness of AgBase fine-tuning on OEQ examples. After simple fine-tuning, the LLaVA model can accurately identify issues that GPT and the original model fail to recognize in zero-shot scenarios.
  • ...and 13 more figures