Optimizing Agricultural Research: A RAG-Based Approach to Mycorrhizal Fungi Information
Mohammad Usman Altam, Md Imtiaz Habib, Tuan Hoang
TL;DR
The paper addresses the challenge of keeping large language models up-to-date with rapidly evolving agricultural knowledge, focusing on arbuscular mycorrhizal fungi (AMF). It presents a Retrieval-Augmented Generation (RAG) pipeline named Mycophyto that combines semantic retrieval from agronomy/biotechnology literature with a structured data extraction module to capture experimental metadata; embeddings are stored in a Pinecone vector database and answers are generated by a Mistral AI LLM using a context-grounded prompt. The dual-layer design enables both text-grounded responses and cross-trial meta-analyses, improving accuracy and actionable insights for AMF-based sustainable agriculture. The results demonstrate effective retrieval and grounded generation, with a CLI-based interface for researchers and a reproducible local deployment workflow that supports scalable ingestion and evaluation.
Abstract
Retrieval-Augmented Generation (RAG) represents a transformative approach within natural language processing (NLP), combining neural information retrieval with generative language modeling to enhance both contextual accuracy and factual reliability of responses. Unlike conventional Large Language Models (LLMs), which are constrained by static training corpora, RAG-powered systems dynamically integrate domain-specific external knowledge sources, thereby overcoming temporal and disciplinary limitations. In this study, we present the design and evaluation of a RAG-enabled system tailored for Mycophyto, with a focus on advancing agricultural applications related to arbuscular mycorrhizal fungi (AMF). These fungi play a critical role in sustainable agriculture by enhancing nutrient acquisition, improving plant resilience under abiotic and biotic stresses, and contributing to soil health. Our system operationalizes a dual-layered strategy: (i) semantic retrieval and augmentation of domain-specific content from agronomy and biotechnology corpora using vector embeddings, and (ii) structured data extraction to capture predefined experimental metadata such as inoculation methods, spore densities, soil parameters, and yield outcomes. This hybrid approach ensures that generated responses are not only semantically aligned but also supported by structured experimental evidence. To support scalability, embeddings are stored in a high-performance vector database, allowing near real-time retrieval from an evolving literature base. Empirical evaluation demonstrates that the proposed pipeline retrieves and synthesizes highly relevant information regarding AMF interactions with crop systems, such as tomato (Solanum lycopersicum). The framework underscores the potential of AI-driven knowledge discovery to accelerate agroecological innovation and enhance decision-making in sustainable farming systems.
