AgroNVILA: Perception-Reasoning Decoupling for Multi-view Agricultural Multimodal Large Language Models

Jiarui Zhang; Junqi Hu; Zurong Mai; Yuhang Chen; Shuohong Lou; Henglian Huang; Lingyuan Zhao; Jianxi Huang; Yutong Lu; Haohuan Fu; Juepeng Zheng

AgroNVILA: Perception-Reasoning Decoupling for Multi-view Agricultural Multimodal Large Language Models

Jiarui Zhang, Junqi Hu, Zurong Mai, Yuhang Chen, Shuohong Lou, Henglian Huang, Lingyuan Zhao, Jianxi Huang, Yutong Lu, Haohuan Fu, Juepeng Zheng

Abstract

Agricultural multimodal reasoning requires robust spatial understanding across varying scales, from ground-level close-ups to top-down UAV and satellite imagery. Existing Multi-modal Large Language Models (MLLMs) suffer from a significant "terrestrial-centric" bias, causing scale confusion and logic drift during complex agricultural planning. To address this, we introduce the first large-scale AgroOmni (288K), a multi-view training corpus designed to capture diverse spatial topologies and scales in modern precision agriculture. Built on this dataset, we propose AgroNVILA, an MLLM that utilizes a novel Perception-Reasoning Decoupling (PRD) architecture. On the perception side, we incorporate a View-Conditioned Meta-Net (VCMN), which injects macroscopic spatial context into visual tokens, resolving scale ambiguities with minimal computational overhead. On the reasoning side, Agriculture-aware Relative Policy Optimization (ARPO) leverages reinforcement learning to align the model's decision-making with expert agricultural logic, preventing statistical shortcuts. Extensive experiments demonstrate that AgroNVILA outperforms state-of-the-art MLLMs, achieving significant improvements (+15.18%) in multi-altitude agricultural reasoning, reflecting its robust capability for holistic agricultural spatial planning.

AgroNVILA: Perception-Reasoning Decoupling for Multi-view Agricultural Multimodal Large Language Models

Abstract

Paper Structure (44 sections, 13 equations, 7 figures, 11 tables)

This paper contains 44 sections, 13 equations, 7 figures, 11 tables.

Introduction
Related Work
Multi-modal Large Language Models in Agriculture.
Reinforcement Learning for Alignment
AgroOmni
Data Construction
Dataset Statistics and Analysis
AgroNVILA
Overall Architecture
View-Conditioned Meta-Net: Vision-Side Prior Injection
Agriculture-aware Relative Policy Optimization (ARPO)
Experiment
Experimental Setup
Main Results
Ablation Studies
...and 29 more sections

Figures (7)

Figure 1: Overview of AgroNVILA. Comparison of our framework with existing paradigms, and its comprehensive performance radar on the AgroMind benchmark
Figure 2: The curation pipeline of AgroOmni training set contains three stages, i.e., data collection and pre-processing, question generation, and quality control.
Figure 3: Comprehensive statistics of the AgroOmni. (a) Classification of 14 fine-grained agricultural tasks across four cognitive dimensions (b) Multi-view data scale and task distribution across UAV, Satellite, and Ground perspectives (c-d) Wordclouds for QA pairs, highlighting the high density of domain-specific agronomic terminology
Figure 4: Architecture of AgroNVILA. Driven by a Perception-Reasoning Decoupling (PRD) paradigm, the framework sequentially integrates a View-Conditioned Meta-Net (VCMN) for multi-view spatial anchoring and an ARPO module for expert-aligned logical reasoning.
Figure 5: Evolution of spatial cognition on a Satellite scene.Red (Baselines): Scale collapse and micro-texture hallucinations. Blue (+VCMN): Perspective anchoring for regional awareness. Green (+ARPO): Topological alignment for holistic landscape synthesis. Additional cases are provided in \ref{['suppl:case_study']}.
...and 2 more figures

AgroNVILA: Perception-Reasoning Decoupling for Multi-view Agricultural Multimodal Large Language Models

Abstract

AgroNVILA: Perception-Reasoning Decoupling for Multi-view Agricultural Multimodal Large Language Models

Authors

Abstract

Table of Contents

Figures (7)