Table of Contents
Fetching ...

Leveraging Vision Language Models for Specialized Agricultural Tasks

Muhammad Arbab Arshad, Talukder Zaki Jubery, Tirtho Roy, Rim Nassiri, Asheesh K. Singh, Arti Singh, Chinmay Hegde, Baskar Ganapathysubramanian, Aditya Balu, Adarsh Krishnamurthy, Soumik Sarkar

TL;DR

This work introduces AgEval, a comprehensive benchmark for evaluating Vision Language Models on plant stress phenotyping tasks, addressing data-scarce agricultural settings with zero-shot and few-shot in-context learning. It defines a task taxonomy (Identification, Classification, Quantification), curates a 12-dataset benchmark, and assesses six state-of-the-art VLMs using $F1$, $NMAE$, and $MRR$ metrics, along with analyses of bullseye example relevance and intra-task uniformity via the coefficient of variation. Key findings show that large models like GPT-4o exhibit strong few-shot gains (e.g., $F1$ rising from $46.24\%$ to $73.37\%$ in 8-shot identification) and that carefully chosen exemplars substantially boost performance, while variability across classes and datasets points to domain-specific challenges. The study positions VLMs as viable, adaptable alternatives to traditional specialized models in plant stress phenotyping, provides prompts and a robust evaluation framework, and outlines directions for broader agricultural tasks, data efficiency, and deployment considerations in real-world settings.

Abstract

As Vision Language Models (VLMs) become increasingly accessible to farmers and agricultural experts, there is a growing need to evaluate their potential in specialized tasks. We present AgEval, a comprehensive benchmark for assessing VLMs' capabilities in plant stress phenotyping, offering a solution to the challenge of limited annotated data in agriculture. Our study explores how general-purpose VLMs can be leveraged for domain-specific tasks with only a few annotated examples, providing insights into their behavior and adaptability. AgEval encompasses 12 diverse plant stress phenotyping tasks, evaluating zero-shot and few-shot in-context learning performance of state-of-the-art models including Claude, GPT, Gemini, and LLaVA. Our results demonstrate VLMs' rapid adaptability to specialized tasks, with the best-performing model showing an increase in F1 scores from 46.24% to 73.37% in 8-shot identification. To quantify performance disparities across classes, we introduce metrics such as the coefficient of variation (CV), revealing that VLMs' training impacts classes differently, with CV ranging from 26.02% to 58.03%. We also find that strategic example selection enhances model reliability, with exact category examples improving F1 scores by 15.38% on average. AgEval establishes a framework for assessing VLMs in agricultural applications, offering valuable benchmarks for future evaluations. Our findings suggest that VLMs, with minimal few-shot examples, show promise as a viable alternative to traditional specialized models in plant stress phenotyping, while also highlighting areas for further refinement. Results and benchmark details are available at: https://github.com/arbab-ml/AgEval

Leveraging Vision Language Models for Specialized Agricultural Tasks

TL;DR

This work introduces AgEval, a comprehensive benchmark for evaluating Vision Language Models on plant stress phenotyping tasks, addressing data-scarce agricultural settings with zero-shot and few-shot in-context learning. It defines a task taxonomy (Identification, Classification, Quantification), curates a 12-dataset benchmark, and assesses six state-of-the-art VLMs using , , and metrics, along with analyses of bullseye example relevance and intra-task uniformity via the coefficient of variation. Key findings show that large models like GPT-4o exhibit strong few-shot gains (e.g., rising from to in 8-shot identification) and that carefully chosen exemplars substantially boost performance, while variability across classes and datasets points to domain-specific challenges. The study positions VLMs as viable, adaptable alternatives to traditional specialized models in plant stress phenotyping, provides prompts and a robust evaluation framework, and outlines directions for broader agricultural tasks, data efficiency, and deployment considerations in real-world settings.

Abstract

As Vision Language Models (VLMs) become increasingly accessible to farmers and agricultural experts, there is a growing need to evaluate their potential in specialized tasks. We present AgEval, a comprehensive benchmark for assessing VLMs' capabilities in plant stress phenotyping, offering a solution to the challenge of limited annotated data in agriculture. Our study explores how general-purpose VLMs can be leveraged for domain-specific tasks with only a few annotated examples, providing insights into their behavior and adaptability. AgEval encompasses 12 diverse plant stress phenotyping tasks, evaluating zero-shot and few-shot in-context learning performance of state-of-the-art models including Claude, GPT, Gemini, and LLaVA. Our results demonstrate VLMs' rapid adaptability to specialized tasks, with the best-performing model showing an increase in F1 scores from 46.24% to 73.37% in 8-shot identification. To quantify performance disparities across classes, we introduce metrics such as the coefficient of variation (CV), revealing that VLMs' training impacts classes differently, with CV ranging from 26.02% to 58.03%. We also find that strategic example selection enhances model reliability, with exact category examples improving F1 scores by 15.38% on average. AgEval establishes a framework for assessing VLMs in agricultural applications, offering valuable benchmarks for future evaluations. Our findings suggest that VLMs, with minimal few-shot examples, show promise as a viable alternative to traditional specialized models in plant stress phenotyping, while also highlighting areas for further refinement. Results and benchmark details are available at: https://github.com/arbab-ml/AgEval
Paper Structure (21 sections, 17 equations, 32 figures, 5 tables)

This paper contains 21 sections, 17 equations, 32 figures, 5 tables.

Figures (32)

  • Figure 1: Overview of the AgEval benchmark. The figure showcases sample images across different types of tasks and specific problems, representing diverse plant stress phenotyping challenges in agriculture.
  • Figure 2: Zero-shot comparative performance of VLMs .
  • Figure 3: 8-shot comparative performance of VLMs.
  • Figure 4: Performance comparison of models across 0-shot, 2-shot, and 8-shot settings on various datasets. F1 scores are shown directly, while NMAE is inverted (100 - NMAE) for consistent visualization, with higher values indicating better performance
  • Figure 5: Performance comparison on individual tasks of the AgEval benchmark across different shot settings (0 to 8 shots) for top-4 performing VLMs .
  • ...and 27 more figures

Theorems & Definitions (1)

  • Definition 1: Bullseye Shot