Table of Contents
Fetching ...

Agri-LLaVA: Knowledge-Infused Large Multimodal Assistant on Agricultural Pests and Diseases

Liqiong Wang, Teng Jin, Jinyu Yang, Ales Leonardis, Fangyi Wang, Feng Zheng

TL;DR

This work addresses the challenge of applying large multimodal models to agriculture by building Agri-LLaVA, a knowledge-infused agricultural vision-language model. It constructs a large-scale agricultural multimodal dataset and a two-stage training pipeline (feature alignment pre-training followed by end-to-end instruction-tuning) to inject domain knowledge. The authors also introduce two benchmarks, Agri-LLaVA-Chatbot-Bench and Agri-LLaVA-VQA-Bench, and report that Agri-LLaVA achieves substantial gains over general-domain baselines while attaining about 55.4% of GPT-4 performance on chatbot tasks. By open-sourcing data and models, the work aims to accelerate research in agricultural LMMs, though it acknowledges remaining challenges such as data scarcity and possible inaccuracies in complex real-world scenarios.

Abstract

In the general domain, large multimodal models (LMMs) have achieved significant advancements, yet challenges persist in applying them to specific fields, especially agriculture. As the backbone of the global economy, agriculture confronts numerous challenges, with pests and diseases being particularly concerning due to their complexity, variability, rapid spread, and high resistance. This paper specifically addresses these issues. We construct the first multimodal instruction-following dataset in the agricultural domain, covering over 221 types of pests and diseases with approximately 400,000 data entries. This dataset aims to explore and address the unique challenges in pest and disease control. Based on this dataset, we propose a knowledge-infused training method to develop Agri-LLaVA, an agricultural multimodal conversation system. To accelerate progress in this field and inspire more researchers to engage, we design a diverse and challenging evaluation benchmark for agricultural pests and diseases. Experimental results demonstrate that Agri-LLaVA excels in agricultural multimodal conversation and visual understanding, providing new insights and approaches to address agricultural pests and diseases. By open-sourcing our dataset and model, we aim to promote research and development in LMMs within the agricultural domain and make significant contributions to tackle the challenges of agricultural pests and diseases. All resources can be found at https://github.com/Kki2Eve/Agri-LLaVA.

Agri-LLaVA: Knowledge-Infused Large Multimodal Assistant on Agricultural Pests and Diseases

TL;DR

This work addresses the challenge of applying large multimodal models to agriculture by building Agri-LLaVA, a knowledge-infused agricultural vision-language model. It constructs a large-scale agricultural multimodal dataset and a two-stage training pipeline (feature alignment pre-training followed by end-to-end instruction-tuning) to inject domain knowledge. The authors also introduce two benchmarks, Agri-LLaVA-Chatbot-Bench and Agri-LLaVA-VQA-Bench, and report that Agri-LLaVA achieves substantial gains over general-domain baselines while attaining about 55.4% of GPT-4 performance on chatbot tasks. By open-sourcing data and models, the work aims to accelerate research in agricultural LMMs, though it acknowledges remaining challenges such as data scarcity and possible inaccuracies in complex real-world scenarios.

Abstract

In the general domain, large multimodal models (LMMs) have achieved significant advancements, yet challenges persist in applying them to specific fields, especially agriculture. As the backbone of the global economy, agriculture confronts numerous challenges, with pests and diseases being particularly concerning due to their complexity, variability, rapid spread, and high resistance. This paper specifically addresses these issues. We construct the first multimodal instruction-following dataset in the agricultural domain, covering over 221 types of pests and diseases with approximately 400,000 data entries. This dataset aims to explore and address the unique challenges in pest and disease control. Based on this dataset, we propose a knowledge-infused training method to develop Agri-LLaVA, an agricultural multimodal conversation system. To accelerate progress in this field and inspire more researchers to engage, we design a diverse and challenging evaluation benchmark for agricultural pests and diseases. Experimental results demonstrate that Agri-LLaVA excels in agricultural multimodal conversation and visual understanding, providing new insights and approaches to address agricultural pests and diseases. By open-sourcing our dataset and model, we aim to promote research and development in LMMs within the agricultural domain and make significant contributions to tackle the challenges of agricultural pests and diseases. All resources can be found at https://github.com/Kki2Eve/Agri-LLaVA.

Paper Structure

This paper contains 28 sections, 5 equations, 17 figures, 7 tables.

Figures (17)

  • Figure 1: The data statistics of our agricultural multimodal instruction-following data.
  • Figure 2: An example of our agricultural pests and diseases instruction-following data. At the top are the image along with its corresponding structured knowledge. At the bottom is the instruction-following data generated by GPT-4 based solely on the provided knowledge.
  • Figure 3: Agri-LLaVA network architecture.
  • Figure 4: One example of prompt used to generate disease feature alignment data.
  • Figure 5: One example of prompt used to generate pest feature alignment data.
  • ...and 12 more figures