Climate Finance Bench
Rafik Mankour, Yassine Chafai, Hamada Saleh, Ghassen Ben Hassine, Thibaud Barreau, Peter Tankov
TL;DR
Climate Finance Bench addresses the challenge of grounding QA on corporate climate disclosures by providing an open, end-to-end benchmark built from 33 reports across 11 sectors and 330 expert-validated QA pairs spanning extraction, numerical reasoning, and logical inference. It systematically compares RAG configurations and multiple LLM back-ends, highlighting that retrieval quality is the main bottleneck and that hybrid dense+BM25 retrieval with cross-encoder reranking yields the strongest results among tested setups. The study also integrates automated evaluation via an LLM-as-a-Judge and reports environmental footprints, advocating for transparent carbon accounting in AI-for-climate applications and promoting lighter, quantized models to reduce emissions with minimal accuracy loss. Practically, the benchmark serves as a reproducible test-bed for researchers and practitioners to optimize factual accuracy, retrieval coverage, and sustainability in climate-finance QA workflows, while underscoring the continued need for human oversight in high-stakes contexts.
Abstract
Climate Finance Bench introduces an open benchmark that targets question-answering over corporate climate disclosures using Large Language Models. We curate 33 recent sustainability reports in English drawn from companies across all 11 GICS sectors and annotate 330 expert-validated question-answer pairs that span pure extraction, numerical reasoning, and logical reasoning. Building on this dataset, we propose a comparison of RAG (retrieval-augmented generation) approaches. We show that the retriever's ability to locate passages that actually contain the answer is the chief performance bottleneck. We further argue for transparent carbon reporting in AI-for-climate applications, highlighting advantages of techniques such as Weight Quantization.
