Table of Contents
Fetching ...

LeafNet: A Large-Scale Dataset and Comprehensive Benchmark for Foundational Vision-Language Understanding of Plant Diseases

Khang Nguyen Quoc, Phuong D. Dao, Luyl-Da Quach

TL;DR

Direct comparison between vision-only models and VLMs demonstrates the critical advantage of multimodal architectures: fine-tuned VLMs outperform traditional vision models, confirming that integrating linguistic representations significantly enhances diagnostic precision.

Abstract

Foundation models and vision-language pre-training have significantly advanced Vision-Language Models (VLMs), enabling multimodal processing of visual and linguistic data. However, their application in domain-specific agricultural tasks, such as plant pathology, remains limited due to the lack of large-scale, comprehensive multimodal image--text datasets and benchmarks. To address this gap, we introduce LeafNet, a comprehensive multimodal dataset, and LeafBench, a visual question-answering benchmark developed to systematically evaluate the capabilities of VLMs in understanding plant diseases. The dataset comprises 186,000 leaf digital images spanning 97 disease classes, paired with metadata, generating 13,950 question-answer pairs spanning six critical agricultural tasks. The questions assess various aspects of plant pathology understanding, including visual symptom recognition, taxonomic relationships, and diagnostic reasoning. Benchmarking 12 state-of-the-art VLMs on our LeafBench dataset, we reveal substantial disparity in their disease understanding capabilities. Our study shows performance varies markedly across tasks: binary healthy--diseased classification exceeds 90\% accuracy, while fine-grained pathogen and species identification remains below 65\%. Direct comparison between vision-only models and VLMs demonstrates the critical advantage of multimodal architectures: fine-tuned VLMs outperform traditional vision models, confirming that integrating linguistic representations significantly enhances diagnostic precision. These findings highlight critical gaps in current VLMs for plant pathology applications and underscore the need for LeafBench as a rigorous framework for methodological advancement and progress evaluation toward reliable AI-assisted plant disease diagnosis. Code is available at https://github.com/EnalisUs/LeafBench.

LeafNet: A Large-Scale Dataset and Comprehensive Benchmark for Foundational Vision-Language Understanding of Plant Diseases

TL;DR

Direct comparison between vision-only models and VLMs demonstrates the critical advantage of multimodal architectures: fine-tuned VLMs outperform traditional vision models, confirming that integrating linguistic representations significantly enhances diagnostic precision.

Abstract

Foundation models and vision-language pre-training have significantly advanced Vision-Language Models (VLMs), enabling multimodal processing of visual and linguistic data. However, their application in domain-specific agricultural tasks, such as plant pathology, remains limited due to the lack of large-scale, comprehensive multimodal image--text datasets and benchmarks. To address this gap, we introduce LeafNet, a comprehensive multimodal dataset, and LeafBench, a visual question-answering benchmark developed to systematically evaluate the capabilities of VLMs in understanding plant diseases. The dataset comprises 186,000 leaf digital images spanning 97 disease classes, paired with metadata, generating 13,950 question-answer pairs spanning six critical agricultural tasks. The questions assess various aspects of plant pathology understanding, including visual symptom recognition, taxonomic relationships, and diagnostic reasoning. Benchmarking 12 state-of-the-art VLMs on our LeafBench dataset, we reveal substantial disparity in their disease understanding capabilities. Our study shows performance varies markedly across tasks: binary healthy--diseased classification exceeds 90\% accuracy, while fine-grained pathogen and species identification remains below 65\%. Direct comparison between vision-only models and VLMs demonstrates the critical advantage of multimodal architectures: fine-tuned VLMs outperform traditional vision models, confirming that integrating linguistic representations significantly enhances diagnostic precision. These findings highlight critical gaps in current VLMs for plant pathology applications and underscore the need for LeafBench as a rigorous framework for methodological advancement and progress evaluation toward reliable AI-assisted plant disease diagnosis. Code is available at https://github.com/EnalisUs/LeafBench.
Paper Structure (23 sections, 14 figures, 8 tables)

This paper contains 23 sections, 14 figures, 8 tables.

Figures (14)

  • Figure 1: The LeafNet Curation and Benchmarking Pipeline. (A), Curation process overview. Metadata is synthesized from authoritative sources (NIH, NIFA) to map raw images to biological taxonomies, including species, disease, pathogenic agent, and symptom descriptions. (B), Expert verification and benchmarking. All metadata-image pairs undergo agricultural expert review to filter noisy samples. From this verified data, we construct LeafBench, a curated subset designed to benchmark on image classification, few-shot learning, and .
  • Figure 2: The LeafNet curation and benchmarking pipeline. (A), Curation process overview. Metadata is synthesized from authoritative sources (NIH, NIFA) to map raw images to biological taxonomies, including Species, Disease, Pathogenic Agent, and Symptom descriptions. (B), Expert verification and benchmarking. All metadata-image pairs undergo review by agricultural experts to filter out noisy samples. From this verified data, we construct LeafBench, a curated subset designed to benchmark on image classification, few-shot learning, and .
  • Figure 3: The distribution of digital image acquisition by country.
  • Figure 4: Illustration of the six Q&A task types in LeafBench: Disease Identification (DI), Pathogen Classification (PC), Crop Species Identification (CSI), Symptom Identification (SI), Healthy–Diseased Classification (HDC), and Scientific Name Classification (SNC).
  • Figure 5: Distribution of question-answer pairs across diagnostic tasks in LeafBench. The bar chart shows the distribution for each of the six tasks: Healthy-Diseased Classification (HDC), Disease Classification (DC), Crop Species Identification (CSI), Scientific Name Classification (SNC), Pathogen Classification (PC), and Symptom Identification (SI). The numbers above the bars indicate the raw sample count for each category.
  • ...and 9 more figures