Table of Contents
Fetching ...

AgriCoT: A Chain-of-Thought Benchmark for Evaluating Reasoning in Vision-Language Models for Agriculture

Yibin Wen, Qingmei Li, Zi Ye, Jiarui Zhang, Jing Wu, Zurong Mai, Shuohong Lou, Yuhang Chen, Henglian Huang, Xiaoya Fan, Yang Zhang, Lingyuan Zhao, Haohuan Fu, Huang Jianxi, Juepeng Zheng

TL;DR

AgriCoT establishes the first agriculture-focused VQA benchmark that explicitly evaluates Chain-of-Thought reasoning. By aggregating 4,535 QA pairs from multiple datasets and enforcing a formal CoT generation and refinement process, it enables simultaneous assessment of final answers and multi-step reasoning. The study reveals that proprietary VLMs achieve higher final accuracy but lag in reasoning depth, underscoring the need for CoT-centric evaluation to guide domain-specific model development. The benchmark, with its five problem dimensions and extensive cross-model analysis, provides a foundation for advancing reliable, interpretable AI in precision agriculture.

Abstract

Recent advancements in Vision-Language Models (VLMs) have significantly transformed various industries. In agriculture, these dual-modal capabilities offer promising applications such as precision farming, crop monitoring, pest detection, and environmental sustainability. While several Visual Question Answering (VQA) datasets and benchmarks have been developed to evaluate VLM performance, they often fail to adequately assess the critical reasoning and problem-solving skills required in complex agricultural contexts. To address this gap, we introduce AgriCoT, a VQA dataset that incorporates Chain-of-Thought (CoT) reasoning, specifically designed to evaluate the reasoning capabilities of VLMs. With 4,535 carefully curated samples, AgriCoT offers a comprehensive and robust evaluation of reasoning abilities for VLMs, particularly in zero-shot scenarios, by focusing on their capacity to engage in logical reasoning and effective problem-solving. Our evaluations, conducted with 26 representative VLMs, including both proprietary and open-source models, reveal that while some proprietary models excel at answering questions, there is a notable and significant gap in their reasoning capabilities. This underscores the importance of incorporating CoT for more precise and effective assessments. Our dataset are available at https://huggingface.co/datasets/wenyb/AgriCoT.

AgriCoT: A Chain-of-Thought Benchmark for Evaluating Reasoning in Vision-Language Models for Agriculture

TL;DR

AgriCoT establishes the first agriculture-focused VQA benchmark that explicitly evaluates Chain-of-Thought reasoning. By aggregating 4,535 QA pairs from multiple datasets and enforcing a formal CoT generation and refinement process, it enables simultaneous assessment of final answers and multi-step reasoning. The study reveals that proprietary VLMs achieve higher final accuracy but lag in reasoning depth, underscoring the need for CoT-centric evaluation to guide domain-specific model development. The benchmark, with its five problem dimensions and extensive cross-model analysis, provides a foundation for advancing reliable, interpretable AI in precision agriculture.

Abstract

Recent advancements in Vision-Language Models (VLMs) have significantly transformed various industries. In agriculture, these dual-modal capabilities offer promising applications such as precision farming, crop monitoring, pest detection, and environmental sustainability. While several Visual Question Answering (VQA) datasets and benchmarks have been developed to evaluate VLM performance, they often fail to adequately assess the critical reasoning and problem-solving skills required in complex agricultural contexts. To address this gap, we introduce AgriCoT, a VQA dataset that incorporates Chain-of-Thought (CoT) reasoning, specifically designed to evaluate the reasoning capabilities of VLMs. With 4,535 carefully curated samples, AgriCoT offers a comprehensive and robust evaluation of reasoning abilities for VLMs, particularly in zero-shot scenarios, by focusing on their capacity to engage in logical reasoning and effective problem-solving. Our evaluations, conducted with 26 representative VLMs, including both proprietary and open-source models, reveal that while some proprietary models excel at answering questions, there is a notable and significant gap in their reasoning capabilities. This underscores the importance of incorporating CoT for more precise and effective assessments. Our dataset are available at https://huggingface.co/datasets/wenyb/AgriCoT.

Paper Structure

This paper contains 51 sections, 27 figures, 11 tables.

Figures (27)

  • Figure 1: Comparison of AgriCoT with previous agricultural multimodal benchmarks. AgriCoT possesses four key advantages: multi-step reasoning, multimodal alignment, long-form reasoning and reasoning evaluation.
  • Figure 2: Comparison of VLMs across multiple dimensions.
  • Figure 3: The number of samples across different dimensions in AgriCoT.
  • Figure 4: Hierarchical task system of AgriCoT. Based on the progressive cognitive pipeline in agricultural intelligence, AgriCoT constructs five evaluation dimensions (such as object detection, quantitative analysis, disease monitoring, spatial understanding and environmental management), covering 15 different and diverse task types.
  • Figure 5: The construction of AgriCoT benchmark primarily comprises four steps: collecting samples from data sources, ensuring the quality of the samples, generating a CoT for each QA pair, and conducting a comprehensive evaluation of the representative VLMs.
  • ...and 22 more figures