AgroGPT: Efficient Agricultural Vision-Language Model with Expert Tuning

Muhammad Awais; Ali Husain Salem Abdulla Alharthi; Amandeep Kumar; Hisham Cholakkal; Rao Muhammad Anwer

AgroGPT: Efficient Agricultural Vision-Language Model with Expert Tuning

Muhammad Awais, Ali Husain Salem Abdulla Alharthi, Amandeep Kumar, Hisham Cholakkal, Rao Muhammad Anwer

TL;DR

The work tackles the lack of agricultural vision-language data by constructing AgroInstruct from vision-only datasets to bootstrap expert-tuning for AgroGPT, an efficient vision-language model. The approach adds a third expert-tuning stage atop standard visual instruction tuning, yielding two models (AgroGPT-3B and AgroGPT-7B) trained on 70k expert conversations. AgroEvals provides domain-specific evaluation across six VQA tasks, showing significant gains in fine-grained agricultural concepts and multi-turn conversations, with competitive performance against open/closed models and clear generalization to unseen datasets. This pipeline enables practical, expert-level agricultural reasoning in multimodal interactions while remaining computationally efficient and open for extension.

Abstract

Significant progress has been made in advancing large multimodal conversational models (LMMs), capitalizing on vast repositories of image-text data available online. Despite this progress, these models often encounter substantial domain gaps, hindering their ability to engage in complex conversations across new domains. Recent efforts have aimed to mitigate this issue, albeit relying on domain-specific image-text data to curate instruction-tuning data. However, many domains, such as agriculture, lack such vision-language data. In this work, we propose an approach to construct instruction-tuning data that harnesses vision-only data for the agriculture domain. We utilize diverse agricultural datasets spanning multiple domains, curate class-specific information, and employ large language models (LLMs) to construct an expert-tuning set, resulting in a 70k expert-tuning dataset called AgroInstruct. Subsequently, we expert-tuned and created AgroGPT, an efficient LMM that can hold complex agriculture-related conversations and provide useful insights. We also develop AgroEvals for evaluation and compare {AgroGPT's} performance with large open and closed-source models. {AgroGPT} excels at identifying fine-grained agricultural concepts, can act as an agriculture expert, and provides helpful information for multimodal agriculture questions. The code, datasets, and models are available at https://github.com/awaisrauf/agroGPT.

AgroGPT: Efficient Agricultural Vision-Language Model with Expert Tuning

TL;DR

Abstract

AgroGPT: Efficient Agricultural Vision-Language Model with Expert Tuning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)