AgriBench: A Hierarchical Agriculture Benchmark for Multimodal Large Language Models
Yutong Zhou, Masahiro Ryo
TL;DR
AgriBench addresses the lack of agriculture-specific evaluation for multimodal large language models by introducing a hierarchical, multi-task benchmark and MM-LUCAS, a multimodal EU land-use/land-cover dataset. The framework spans five task levels from basic recognition to human-aligned suggestions, leveraging RGB images, segmentation masks, depth maps, and LandQA prompts to comprehensively assess MM-LLMs. Five MM-LLMs (two open-source and three closed-source) are evaluated, revealing that while general agricultural knowledge is well captured, domain-specific inference such as disease diagnosis still benefits from specialized fine-tuning and richer multimodal cues. The work lays a foundation for robust, domain-aware MM-LLMs in agriculture, with MM-LUCAS enabling richer annotations and future expert-knowledge models for farming decision-support and policy planning.
Abstract
We introduce AgriBench, the first agriculture benchmark designed to evaluate MultiModal Large Language Models (MM-LLMs) for agriculture applications. To further address the agriculture knowledge-based dataset limitation problem, we propose MM-LUCAS, a multimodal agriculture dataset, that includes 1,784 landscape images, segmentation masks, depth maps, and detailed annotations (geographical location, country, date, land cover and land use taxonomic details, quality scores, aesthetic scores, etc), based on the Land Use/Cover Area Frame Survey (LUCAS) dataset, which contains comparable statistics on land use and land cover for the European Union (EU) territory. This work presents a groundbreaking perspective in advancing agriculture MM-LLMs and is still in progress, offering valuable insights for future developments and innovations in specific expert knowledge-based MM-LLMs.
