An EcoSage Assistant: Towards Building A Multimodal Plant Care Dialogue Assistant
Mohit Tomar, Abhisek Tiwari, Tulika Saha, Prince Jha, Sriparna Saha
TL;DR
This work tackles the lack of plant-care dialogue data and multimodal support by introducing the Plantational dataset and the EcoSage dialogue assistant. Plantational comprises approximately one thousand plant-care conversations with accompanying images and annotated intents and dialogue acts, enabling multimodal evaluation. The authors benchmark multiple LLMs and VLMs under zero-shot, few-shot, and fine-tuning settings and propose EcoSage, which uses BLIP-2 visual encoding and LoRA-based adapters within Vicuna for multimodal response generation. Results indicate that incorporating images improves context-specific responses but multimodal alignment remains challenging, underscoring the need for semantic evaluation metrics like BERT-F1; the work lays a foundation for practical, multimodal plant-care assistants.
Abstract
In recent times, there has been an increasing awareness about imminent environmental challenges, resulting in people showing a stronger dedication to taking care of the environment and nurturing green life. The current $19.6 billion indoor gardening industry, reflective of this growing sentiment, not only signifies a monetary value but also speaks of a profound human desire to reconnect with the natural world. However, several recent surveys cast a revealing light on the fate of plants within our care, with more than half succumbing primarily due to the silent menace of improper care. Thus, the need for accessible expertise capable of assisting and guiding individuals through the intricacies of plant care has become paramount more than ever. In this work, we make the very first attempt at building a plant care assistant, which aims to assist people with plant(-ing) concerns through conversations. We propose a plant care conversational dataset named Plantational, which contains around 1K dialogues between users and plant care experts. Our end-to-end proposed approach is two-fold : (i) We first benchmark the dataset with the help of various large language models (LLMs) and visual language model (VLM) by studying the impact of instruction tuning (zero-shot and few-shot prompting) and fine-tuning techniques on this task; (ii) finally, we build EcoSage, a multi-modal plant care assisting dialogue generation framework, incorporating an adapter-based modality infusion using a gated mechanism. We performed an extensive examination (both automated and manual evaluation) of the performance exhibited by various LLMs and VLM in the generation of the domain-specific dialogue responses to underscore the respective strengths and weaknesses of these diverse models.
