Table of Contents
Fetching ...

A Multimodal Benchmark Dataset and Model for Crop Disease Diagnosis

Xiang Liu, Zhaoxiang Liu, Huan Hu, Zezhou Chen, Kohou Wang, Kai Wang, Shiguo Lian

TL;DR

The paper addresses the need for accurate, knowledge-rich crop disease diagnosis using multimodal AI. It introduces the Crop Disease Domain Multimodal (CDDM) dataset, comprising 137k images and 1M QA pairs spanning diagnosis and knowledge, and a LoRA-based finetuning strategy that updates the visual encoder, adapter, and language model to adapt LVLMs like Qwen-VL-Chat to agriculture. Experiments show that models finetuned on CDDM outperform baselines on both diagnosis accuracy and knowledge QA, highlighting the value of domain-specific instruction-following data. By releasing the dataset and code, the work provides a practical resource to accelerate development of farmers' decision-support tools and advances in agricultural multimodal AI. It bridges cutting-edge vision-language models with domain-specific agricultural needs.

Abstract

While conversational generative AI has shown considerable potential in enhancing decision-making for agricultural professionals, its exploration has predominantly been anchored in text-based interactions. The evolution of multimodal conversational AI, leveraging vast amounts of image-text data from diverse sources, marks a significant stride forward. However, the application of such advanced vision-language models in the agricultural domain, particularly for crop disease diagnosis, remains underexplored. In this work, we present the crop disease domain multimodal (CDDM) dataset, a pioneering resource designed to advance the field of agricultural research through the application of multimodal learning techniques. The dataset comprises 137,000 images of various crop diseases, accompanied by 1 million question-answer pairs that span a broad spectrum of agricultural knowledge, from disease identification to management practices. By integrating visual and textual data, CDDM facilitates the development of sophisticated question-answering systems capable of providing precise, useful advice to farmers and agricultural professionals. We demonstrate the utility of the dataset by finetuning state-of-the-art multimodal models, showcasing significant improvements in crop disease diagnosis. Specifically, we employed a novel finetuning strategy that utilizes low-rank adaptation (LoRA) to finetune the visual encoder, adapter and language model simultaneously. Our contributions include not only the dataset but also a finetuning strategy and a benchmark to stimulate further research in agricultural technology, aiming to bridge the gap between advanced AI techniques and practical agricultural applications. The dataset is available at https: //github.com/UnicomAI/UnicomBenchmark/tree/main/CDDMBench.

A Multimodal Benchmark Dataset and Model for Crop Disease Diagnosis

TL;DR

The paper addresses the need for accurate, knowledge-rich crop disease diagnosis using multimodal AI. It introduces the Crop Disease Domain Multimodal (CDDM) dataset, comprising 137k images and 1M QA pairs spanning diagnosis and knowledge, and a LoRA-based finetuning strategy that updates the visual encoder, adapter, and language model to adapt LVLMs like Qwen-VL-Chat to agriculture. Experiments show that models finetuned on CDDM outperform baselines on both diagnosis accuracy and knowledge QA, highlighting the value of domain-specific instruction-following data. By releasing the dataset and code, the work provides a practical resource to accelerate development of farmers' decision-support tools and advances in agricultural multimodal AI. It bridges cutting-edge vision-language models with domain-specific agricultural needs.

Abstract

While conversational generative AI has shown considerable potential in enhancing decision-making for agricultural professionals, its exploration has predominantly been anchored in text-based interactions. The evolution of multimodal conversational AI, leveraging vast amounts of image-text data from diverse sources, marks a significant stride forward. However, the application of such advanced vision-language models in the agricultural domain, particularly for crop disease diagnosis, remains underexplored. In this work, we present the crop disease domain multimodal (CDDM) dataset, a pioneering resource designed to advance the field of agricultural research through the application of multimodal learning techniques. The dataset comprises 137,000 images of various crop diseases, accompanied by 1 million question-answer pairs that span a broad spectrum of agricultural knowledge, from disease identification to management practices. By integrating visual and textual data, CDDM facilitates the development of sophisticated question-answering systems capable of providing precise, useful advice to farmers and agricultural professionals. We demonstrate the utility of the dataset by finetuning state-of-the-art multimodal models, showcasing significant improvements in crop disease diagnosis. Specifically, we employed a novel finetuning strategy that utilizes low-rank adaptation (LoRA) to finetune the visual encoder, adapter and language model simultaneously. Our contributions include not only the dataset but also a finetuning strategy and a benchmark to stimulate further research in agricultural technology, aiming to bridge the gap between advanced AI techniques and practical agricultural applications. The dataset is available at https: //github.com/UnicomAI/UnicomBenchmark/tree/main/CDDMBench.

Paper Structure

This paper contains 14 sections, 8 figures, 1 table.

Figures (8)

  • Figure 1: Example comparison of LVLMs on crop disease diagnosis. Our model accurately identifies crop and disease categories, offering detailed prevention and treatment methods. In contrast, Qwen-VL-Chat fails to determine both crop and disease categories, and provides detailed prevention and treatment methods, as indicated by the red texts.
  • Figure 2: Examples of the crop disease image dataset. Each image represents a different category, and the leaves show a high degree of similarity, from their colors to their shapes. Additionally, some spot diseases display very similar visual features. Among the images, the two marked with red boxes represent different diseases but look very similar; the two marked with yellow boxes belong to different types of crops but have a very similar shape.
  • Figure 3: An instance of our CDDM data. The conversations cover the diagnosis, prevention, and treatment of crop diseases.
  • Figure 4: Distribution of the number of images for crop diseases dataset.
  • Figure 5: The prompt example of utilizing GPT-4 to generate instruction-following data of crop disease diagnosis. In the few-shot example within the "Prompt" part, the QA pairs highlighted in red are carefully crafted to include negative responses. After sequentially entering the "Prompt" part and the "Query" part, GPT-4 can generate 8 similar QA pairs, with negative responses highlighted in green.
  • ...and 3 more figures