AgriGPT-VL: Agricultural Vision-Language Understanding Suite

Bo Yang; Yunkui Chen; Lanfei Feng; Yu Zhang; Xiao Xu; Jianyu Zhang; Nueraili Aierken; Runhe Huang; Hongjian Lin; Yibin Ying; Shijian Li

AgriGPT-VL: Agricultural Vision-Language Understanding Suite

Bo Yang, Yunkui Chen, Lanfei Feng, Yu Zhang, Xiao Xu, Jianyu Zhang, Nueraili Aierken, Runhe Huang, Hongjian Lin, Yibin Ying, Shijian Li

TL;DR

The paper tackles the lack of domain-tailored multimodal tools for agriculture by proposing AgriGPT-VL, a unified framework that couples a large-scale vision–language dataset (Agri-3M-VL) generated via a multi-agent Data Generator, a curriculum-driven training pipeline, and a rigorous evaluation suite (AgriBench-VL-4K). Through progressive textual grounding, shallow and deep multimodal alignment, and GRPO refinement, AgriGPT-VL achieves leading results on VL benchmarks while preserving text-only capabilities, and demonstrates strong generalization via external evaluation. The work emphasizes reproducibility and applicability in low-resource farming contexts by open-sourcing resources and providing a scalable blueprint for domain-specific multimodal systems. Overall, the suite advances agricultural AI by delivering integrated data, models, and benchmarks tailored to agronomic reasoning and evidence-based decision support.

Abstract

Despite rapid advances in multimodal large language models, agricultural applications remain constrained by the scarcity of domain-tailored models, curated vision-language corpora, and rigorous evaluation. To address these challenges, we present the AgriGPT-VL Suite, a unified multimodal framework for agriculture. Our contributions are threefold. First, we introduce Agri-3M-VL, the largest vision-language corpus for agriculture to our knowledge, curated by a scalable multi-agent data generator; it comprises 1M image-caption pairs, 2M image-grounded VQA pairs, 50K expert-level VQA instances, and 15K GRPO reinforcement learning samples. Second, we develop AgriGPT-VL, an agriculture-specialized vision-language model trained via a progressive curriculum of textual grounding, multimodal shallow/deep alignment, and GRPO refinement. This method achieves strong multimodal reasoning while preserving text-only capability. Third, we establish AgriBench-VL-4K, a compact yet challenging evaluation suite with open-ended and image-grounded questions, paired with multi-metric evaluation and an LLM-as-a-judge framework. Experiments show that AgriGPT-VL outperforms leading general-purpose VLMs on AgriBench-VL-4K, achieving higher pairwise win rates in the LLM-as-a-judge evaluation. Meanwhile, it remains competitive on the text-only AgriBench-13K with no noticeable degradation of language ability. Ablation studies further confirm consistent gains from our alignment and GRPO refinement stages. We will open source all of the resources to support reproducible research and deployment in low-resource agricultural settings.

AgriGPT-VL: Agricultural Vision-Language Understanding Suite

TL;DR

Abstract

AgriGPT-VL: Agricultural Vision-Language Understanding Suite

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)