Table of Contents
Fetching ...

Climate Finance Bench

Rafik Mankour, Yassine Chafai, Hamada Saleh, Ghassen Ben Hassine, Thibaud Barreau, Peter Tankov

TL;DR

Climate Finance Bench addresses the challenge of grounding QA on corporate climate disclosures by providing an open, end-to-end benchmark built from 33 reports across 11 sectors and 330 expert-validated QA pairs spanning extraction, numerical reasoning, and logical inference. It systematically compares RAG configurations and multiple LLM back-ends, highlighting that retrieval quality is the main bottleneck and that hybrid dense+BM25 retrieval with cross-encoder reranking yields the strongest results among tested setups. The study also integrates automated evaluation via an LLM-as-a-Judge and reports environmental footprints, advocating for transparent carbon accounting in AI-for-climate applications and promoting lighter, quantized models to reduce emissions with minimal accuracy loss. Practically, the benchmark serves as a reproducible test-bed for researchers and practitioners to optimize factual accuracy, retrieval coverage, and sustainability in climate-finance QA workflows, while underscoring the continued need for human oversight in high-stakes contexts.

Abstract

Climate Finance Bench introduces an open benchmark that targets question-answering over corporate climate disclosures using Large Language Models. We curate 33 recent sustainability reports in English drawn from companies across all 11 GICS sectors and annotate 330 expert-validated question-answer pairs that span pure extraction, numerical reasoning, and logical reasoning. Building on this dataset, we propose a comparison of RAG (retrieval-augmented generation) approaches. We show that the retriever's ability to locate passages that actually contain the answer is the chief performance bottleneck. We further argue for transparent carbon reporting in AI-for-climate applications, highlighting advantages of techniques such as Weight Quantization.

Climate Finance Bench

TL;DR

Climate Finance Bench addresses the challenge of grounding QA on corporate climate disclosures by providing an open, end-to-end benchmark built from 33 reports across 11 sectors and 330 expert-validated QA pairs spanning extraction, numerical reasoning, and logical inference. It systematically compares RAG configurations and multiple LLM back-ends, highlighting that retrieval quality is the main bottleneck and that hybrid dense+BM25 retrieval with cross-encoder reranking yields the strongest results among tested setups. The study also integrates automated evaluation via an LLM-as-a-Judge and reports environmental footprints, advocating for transparent carbon accounting in AI-for-climate applications and promoting lighter, quantized models to reduce emissions with minimal accuracy loss. Practically, the benchmark serves as a reproducible test-bed for researchers and practitioners to optimize factual accuracy, retrieval coverage, and sustainability in climate-finance QA workflows, while underscoring the continued need for human oversight in high-stakes contexts.

Abstract

Climate Finance Bench introduces an open benchmark that targets question-answering over corporate climate disclosures using Large Language Models. We curate 33 recent sustainability reports in English drawn from companies across all 11 GICS sectors and annotate 330 expert-validated question-answer pairs that span pure extraction, numerical reasoning, and logical reasoning. Building on this dataset, we propose a comparison of RAG (retrieval-augmented generation) approaches. We show that the retriever's ability to locate passages that actually contain the answer is the chief performance bottleneck. We further argue for transparent carbon reporting in AI-for-climate applications, highlighting advantages of techniques such as Weight Quantization.

Paper Structure

This paper contains 60 sections, 5 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Accuracy breakdown (correct, incomplete, incorrect) for the Minimal RAG configuration across five LLMs.
  • Figure 2: Accuracy breakdown (correct, incomplete, incorrect) for the Hybrid RAG configuration across the seven LLMs tested.
  • Figure 3: Stepwise impact of successive retrieval upgrades on answer quality (Minimal RAG → + BM25 lexical → + reranking → + HTML conversion). Adding BM25 improves the correct–answer rate from 54.8% to 59.1%, and the hybrid dense–sparse & reranking scheme lifts it further to 62.1%. Introducing Docling’s HTML conversion without extra post‑processing brings the score down to 57.3%, indicating that raw structural noise can offset earlier gains. Bars show absolute counts (annotated) and the associated share of the 330‑question test set.
  • Figure 4: Comparison of LLaMA 3.1-8B Unquantized and 4-bit Quantized under the minimal RAG setting. Quantization leads to negligible accuracy loss while significantly reducing resource usage.
  • Figure 5: Break‑down of answer quality for each question category under the best‑performing setup (Claude 3.5 + hybrid retrieval). Numerical reasoning edges out pure extraction, while logical reasoning lags behind because it demands multi‑hop synthesis across passages.