Table of Contents
Fetching ...

Listen to the Context: Towards Faithful Large Language Models for Retrieval Augmented Generation on Climate Questions

David Thulke, Jakob Kemmler, Christian Dugast, Hermann Ney

TL;DR

This paper tackles the challenge of faithfulness in retrieval-augmented climate question answering by defining an automated evaluation framework for faithfulness versus factuality and analyzing ClimateGPT's instruction fine-tuning. It introduces ClimateGPT Faithful+, which excludes unfaithful training data to substantially improve verified claim support from 30% to 57% on its main benchmark, with further gains on climate-policy and hallucination-detection grounds. The results suggest that post-training data selection and grounding strategies are pivotal for faithful RAG behavior, though retrieval quality and evaluation limitations remain. Overall, the work advances reliable grounding in climate-focused LLMs, with practical implications for policy-relevant information dissemination and public trust.

Abstract

Large language models that use retrieval augmented generation have the potential to unlock valuable knowledge for researchers, policymakers, and the public by making long and technical climate-related documents more accessible. While this approach can help alleviate factual hallucinations by relying on retrieved passages as additional context, its effectiveness depends on whether the model's output remains faithful to these passages. To address this, we explore the automatic assessment of faithfulness of different models in this setting. We then focus on ClimateGPT, a large language model specialised in climate science, to examine which factors in its instruction fine-tuning impact the model's faithfulness. By excluding unfaithful subsets of the model's training data, we develop ClimateGPT Faithful+, which achieves an improvement in faithfulness from 30% to 57% in supported atomic claims according to our automatic metric.

Listen to the Context: Towards Faithful Large Language Models for Retrieval Augmented Generation on Climate Questions

TL;DR

This paper tackles the challenge of faithfulness in retrieval-augmented climate question answering by defining an automated evaluation framework for faithfulness versus factuality and analyzing ClimateGPT's instruction fine-tuning. It introduces ClimateGPT Faithful+, which excludes unfaithful training data to substantially improve verified claim support from 30% to 57% on its main benchmark, with further gains on climate-policy and hallucination-detection grounds. The results suggest that post-training data selection and grounding strategies are pivotal for faithful RAG behavior, though retrieval quality and evaluation limitations remain. Overall, the work advances reliable grounding in climate-focused LLMs, with practical implications for policy-relevant information dissemination and public trust.

Abstract

Large language models that use retrieval augmented generation have the potential to unlock valuable knowledge for researchers, policymakers, and the public by making long and technical climate-related documents more accessible. While this approach can help alleviate factual hallucinations by relying on retrieved passages as additional context, its effectiveness depends on whether the model's output remains faithful to these passages. To address this, we explore the automatic assessment of faithfulness of different models in this setting. We then focus on ClimateGPT, a large language model specialised in climate science, to examine which factors in its instruction fine-tuning impact the model's faithfulness. By excluding unfaithful subsets of the model's training data, we develop ClimateGPT Faithful+, which achieves an improvement in faithfulness from 30% to 57% in supported atomic claims according to our automatic metric.

Paper Structure

This paper contains 20 sections, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Shortened example from the original ClimateGPT and the Faithful+ variant on one example from the Climate Policy Radar task. Text marked in red is not faithful, i.e. it is not supported by the given context. The full example is shown in \ref{['fig:stocktake-example']}.
  • Figure 2: Example comparing the outputs of ClimateGPT and ClimateGPT 7B Faithful+ on one example of the held-out test set. Parts marked in red correspond to claim that are not supported in the given context according to our automatic evaluation.
  • Figure 3: Example comparing the outputs of ClimateGPT and ClimateGPT 7B Faithful+ on one example from the Climate Policy Radar data. Parts marked in red correspond to claim that are not supported in the given context according to our automatic evaluation.