Table of Contents
Fetching ...

AgroNVILA: Perception-Reasoning Decoupling for Multi-view Agricultural Multimodal Large Language Models

Jiarui Zhang, Junqi Hu, Zurong Mai, Yuhang Chen, Shuohong Lou, Henglian Huang, Lingyuan Zhao, Jianxi Huang, Yutong Lu, Haohuan Fu, Juepeng Zheng

Abstract

Agricultural multimodal reasoning requires robust spatial understanding across varying scales, from ground-level close-ups to top-down UAV and satellite imagery. Existing Multi-modal Large Language Models (MLLMs) suffer from a significant "terrestrial-centric" bias, causing scale confusion and logic drift during complex agricultural planning. To address this, we introduce the first large-scale AgroOmni (288K), a multi-view training corpus designed to capture diverse spatial topologies and scales in modern precision agriculture. Built on this dataset, we propose AgroNVILA, an MLLM that utilizes a novel Perception-Reasoning Decoupling (PRD) architecture. On the perception side, we incorporate a View-Conditioned Meta-Net (VCMN), which injects macroscopic spatial context into visual tokens, resolving scale ambiguities with minimal computational overhead. On the reasoning side, Agriculture-aware Relative Policy Optimization (ARPO) leverages reinforcement learning to align the model's decision-making with expert agricultural logic, preventing statistical shortcuts. Extensive experiments demonstrate that AgroNVILA outperforms state-of-the-art MLLMs, achieving significant improvements (+15.18%) in multi-altitude agricultural reasoning, reflecting its robust capability for holistic agricultural spatial planning.

AgroNVILA: Perception-Reasoning Decoupling for Multi-view Agricultural Multimodal Large Language Models

Abstract

Agricultural multimodal reasoning requires robust spatial understanding across varying scales, from ground-level close-ups to top-down UAV and satellite imagery. Existing Multi-modal Large Language Models (MLLMs) suffer from a significant "terrestrial-centric" bias, causing scale confusion and logic drift during complex agricultural planning. To address this, we introduce the first large-scale AgroOmni (288K), a multi-view training corpus designed to capture diverse spatial topologies and scales in modern precision agriculture. Built on this dataset, we propose AgroNVILA, an MLLM that utilizes a novel Perception-Reasoning Decoupling (PRD) architecture. On the perception side, we incorporate a View-Conditioned Meta-Net (VCMN), which injects macroscopic spatial context into visual tokens, resolving scale ambiguities with minimal computational overhead. On the reasoning side, Agriculture-aware Relative Policy Optimization (ARPO) leverages reinforcement learning to align the model's decision-making with expert agricultural logic, preventing statistical shortcuts. Extensive experiments demonstrate that AgroNVILA outperforms state-of-the-art MLLMs, achieving significant improvements (+15.18%) in multi-altitude agricultural reasoning, reflecting its robust capability for holistic agricultural spatial planning.
Paper Structure (44 sections, 13 equations, 7 figures, 11 tables)

This paper contains 44 sections, 13 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: Overview of AgroNVILA. Comparison of our framework with existing paradigms, and its comprehensive performance radar on the AgroMind benchmark
  • Figure 2: The curation pipeline of AgroOmni training set contains three stages, i.e., data collection and pre-processing, question generation, and quality control.
  • Figure 3: Comprehensive statistics of the AgroOmni. (a) Classification of 14 fine-grained agricultural tasks across four cognitive dimensions (b) Multi-view data scale and task distribution across UAV, Satellite, and Ground perspectives (c-d) Wordclouds for QA pairs, highlighting the high density of domain-specific agronomic terminology
  • Figure 4: Architecture of AgroNVILA. Driven by a Perception-Reasoning Decoupling (PRD) paradigm, the framework sequentially integrates a View-Conditioned Meta-Net (VCMN) for multi-view spatial anchoring and an ARPO module for expert-aligned logical reasoning.
  • Figure 5: Evolution of spatial cognition on a Satellite scene.Red (Baselines): Scale collapse and micro-texture hallucinations. Blue (+VCMN): Perspective anchoring for regional awareness. Green (+ARPO): Topological alignment for holistic landscape synthesis. Additional cases are provided in \ref{['suppl:case_study']}.
  • ...and 2 more figures