Table of Contents
Fetching ...

Responsible Retrieval Augmented Generation for Climate Decision Making from Documents

Matyas Juhasz, Kalyan Dutia, Henry Franks, Conor Delahunty, Patrick Fawbert Mills, Harrison Pim

TL;DR

This work introduces a novel evaluation framework with domain-specific dimensions tailored for climate-related documents and applies this framework to evaluate Retrieval-Augmented Generation approaches and assess retrieval- and generation-quality within a prototype tool that answers questions about individual climate law and policy documents.

Abstract

Climate decision making is constrained by the complexity and inaccessibility of key information within lengthy, technical, and multi-lingual documents. Generative AI technologies offer a promising route for improving the accessibility of information contained within these documents, but suffer from limitations. These include (1) a tendency to hallucinate or mis-represent information, (2) difficulty in steering or guaranteeing properties of generated output, and (3) reduced performance in specific technical domains. To address these challenges, we introduce a novel evaluation framework with domain-specific dimensions tailored for climate-related documents. We then apply this framework to evaluate Retrieval-Augmented Generation (RAG) approaches and assess retrieval- and generation-quality within a prototype tool that answers questions about individual climate law and policy documents. In addition, we publish a human-annotated dataset and scalable automated evaluation tools, with the aim of facilitating broader adoption and robust assessment of these systems in the climate domain. Our findings highlight the key components of responsible deployment of RAG to enhance decision-making, while also providing insights into user experience (UX) considerations for safely deploying such systems to build trust with users in high-risk domains.

Responsible Retrieval Augmented Generation for Climate Decision Making from Documents

TL;DR

This work introduces a novel evaluation framework with domain-specific dimensions tailored for climate-related documents and applies this framework to evaluate Retrieval-Augmented Generation approaches and assess retrieval- and generation-quality within a prototype tool that answers questions about individual climate law and policy documents.

Abstract

Climate decision making is constrained by the complexity and inaccessibility of key information within lengthy, technical, and multi-lingual documents. Generative AI technologies offer a promising route for improving the accessibility of information contained within these documents, but suffer from limitations. These include (1) a tendency to hallucinate or mis-represent information, (2) difficulty in steering or guaranteeing properties of generated output, and (3) reduced performance in specific technical domains. To address these challenges, we introduce a novel evaluation framework with domain-specific dimensions tailored for climate-related documents. We then apply this framework to evaluate Retrieval-Augmented Generation (RAG) approaches and assess retrieval- and generation-quality within a prototype tool that answers questions about individual climate law and policy documents. In addition, we publish a human-annotated dataset and scalable automated evaluation tools, with the aim of facilitating broader adoption and robust assessment of these systems in the climate domain. Our findings highlight the key components of responsible deployment of RAG to enhance decision-making, while also providing insights into user experience (UX) considerations for safely deploying such systems to build trust with users in high-risk domains.

Paper Structure

This paper contains 35 sections, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Components in our RAG pipeline
  • Figure 2: Confusion Matrix for Retrieval LLM Judge
  • Figure 3: Confusion matrix of g-eval evaluation of the CPR-generation-policy dimension against human labelled ground truth data
  • Figure 4: Correlation matrix for the faithfulness evaluators
  • Figure 5: Average faithfulness violation scores (lower is better) by evaluator model on the x-axis and response source-model on the y-axis
  • ...and 4 more figures