Fine-Tuning and Evaluating Conversational AI for Agricultural Advisory

Sanyam Singh; Naga Ganesh; Vineet Singh; Lakshmi Pedapudi; Ritesh Kumar; SSP Jyothi; Archana Karanam; C. Yashoda; Mettu Vijaya Rekha Reddy; Shesha Phani Debbesa; Chandan Dash

Fine-Tuning and Evaluating Conversational AI for Agricultural Advisory

Sanyam Singh, Naga Ganesh, Vineet Singh, Lakshmi Pedapudi, Ritesh Kumar, SSP Jyothi, Archana Karanam, C. Yashoda, Mettu Vijaya Rekha Reddy, Shesha Phani Debbesa, Chandan Dash

TL;DR

A hybrid LLM architecture that decouples factual retrieval from conversational delivery is presented: supervised fine-tuning with LoRA on expert-curated GOLDEN FACTS optimizes fact recall, while a separate stitching layer transforms retrieved facts into culturally appropriate, safety-aware responses.

Abstract

Large Language Models show promise for agricultural advisory, yet vanilla models exhibit unsupported recommendations, generic advice lacking specific, actionable detail, and communication styles misaligned with smallholder farmer needs. In high stakes agricultural contexts, where recommendation accuracy has direct consequences for farmer outcomes, these limitations pose challenges for responsible deployment. We present a hybrid LLM architecture that decouples factual retrieval from conversational delivery: supervised fine-tuning with LoRA on expert-curated GOLDEN FACTS (atomic, verified units of agricultural knowledge) optimizes fact recall, while a separate stitching layer transforms retrieved facts into culturally appropriate, safety-aware responses. Our evaluation framework, DG-EVAL, performs atomic fact verification (measuring recall, precision, and contradiction detection) against expert-curated ground truth rather than Wikipedia or retrieved documents. Experiments across multiple model configurations on crops and queries from Bihar, India show that fine-tuning on curated data substantially improves fact recall and F1, while maintaining high relevance. Using a fine-tuned smaller model achieves comparable or better factual quality at a fraction of the cost of frontier models. A stitching layer further improves safety subscores while maintaining high conversational quality. We release the farmerchat-prompts library to enable reproducible development of domain-specific agricultural AI.

Fine-Tuning and Evaluating Conversational AI for Agricultural Advisory

TL;DR

Abstract

Paper Structure (67 sections, 5 figures, 9 tables)

This paper contains 67 sections, 5 figures, 9 tables.

Introduction
Contributions
Background and Related Work
Motivation
Related Work
Agricultural AI Systems.
LLM Fine-Tuning for Specialized Domains.
Evaluation Frameworks for Factual Accuracy.
Hybrid Architectures and Design Motivation.
Data Curation Pipeline
Human Expert Data Curation Pipeline
Annotation Platform and Evaluation Modes
Query Selection and Reviewer Methodology
Quality Assurance and Safety
Synthetic Data Curation Pipeline
...and 52 more sections

Figures (5)

Figure 1: The evaluate.farmer.chat platform used for expert curation and evaluation. The interface supports quality rating, additional/missing information annotation, like/dislike feedback, and side-by-side comparison of model responses, enabling systematic human evaluation of agricultural advisory content. Specific user inputs shown in the image may not be accurate and are for illustration purposes only.
Figure 2: Expert data curation pipeline. Farmer queries are prioritized by frequency and model confidence, then reviewed by domain experts using absolute scoring across five dimensions. Quality assurance includes partial double-review and control pairs for consistency tracking. Comparative evaluation produces preference data for future alignment work.
Figure 3: Synthetic data curation pipeline. Four complementary sources feed into multi-source response generation. All synthetic responses undergo the same Golden Fact extraction pipeline as human-curated data, followed by quality scoring to filter low-confidence or incomplete facts.
Figure 4: Hybrid Engine Architecture. Top: Training pipeline showing data curation and LoRA fine-tuning. Bottom: Inference pipeline where the fine-tuned model retrieves Golden Facts, which are transformed into conversational responses by the stitching layer.
Figure 5: Specificity illustration. Answer 1 triggers only 3 of 7 contextual anchors (sparse highlighting); Answer 2 triggers all 7 (dense highlighting), immediately conveying the difference between generic and specific agricultural advice.

Fine-Tuning and Evaluating Conversational AI for Agricultural Advisory

TL;DR

Abstract

Fine-Tuning and Evaluating Conversational AI for Agricultural Advisory

Authors

TL;DR

Abstract

Table of Contents

Figures (5)