CAIRNS: Balancing Readability and Scientific Accuracy in Climate Adaptation Question Answering
Liangji Kong, Aditya Joshi, Sarvnaz Karimi
TL;DR
CAIRNS tackles the challenge of making climate adaptation knowledge both readable and scientifically grounded by coupling location-aware retrieval with structured prompting (ScholarGuide) and a kappa-weighted hybrid evaluator that aligns automated and expert judgments. It introduces a ReAct-style loop for iterative evidence gathering and verification, and demonstrates that the ScholarGuide approach yields stronger structure and verifiability while maintaining accuracy. Although data-augmented retrieval shows limited benefit at present, the framework achieves strong average performance and provides a scalable, expert-aligned evaluation method without fine-tuning. The work advances practical, region-aware climate QA with an emphasis on traceable evidence and reproducible evaluation.
Abstract
Climate adaptation strategies are proposed in response to climate change. They are practised in agriculture to sustain food production. These strategies can be found in unstructured data (for example, scientific literature from the Elsevier website) or structured (heterogeneous climate data via government APIs). We present Climate Adaptation question-answering with Improved Readability and Noted Sources (CAIRNS), a framework that enables experts -- farmer advisors -- to obtain credible preliminary answers from complex evidence sources from the web. It enhances readability and citation reliability through a structured ScholarGuide prompt and achieves robust evaluation via a consistency-weighted hybrid evaluator that leverages inter-model agreement with experts. Together, these components enable readable, verifiable, and domain-grounded question-answering without fine-tuning or reinforcement learning. Using a previously reported dataset of expert-curated question-answers, we show that CAIRNS outperforms the baselines on most of the metrics. Our thorough ablation study confirms the results on all metrics. To validate our LLM-based evaluation, we also report an analysis of correlations against human judgment.
