Table of Contents
Fetching ...

ESGBench: A Benchmark for Explainable ESG Question Answering in Corporate Sustainability Reports

Sherine George, Nithish Saji

TL;DR

ESGBench addresses the challenge of explainable QA over ESG disclosures by providing a reproducible pipeline that ingests ESG/TCFD PDFs, builds a chunked and table-aware index, and generates QA pairs with verbatim evidence. It offers an evaluation suite with EM, F1, Numeric Accuracy, Recall@K, and per-category scores, plus a simple RAG baseline to highlight current limitations in numeric KPI grounding and table grounding. The dataset comprises 119 QA pairs from 10 companies, with 40–50% table-derived content, illustrating the need for robust numeric and table reasoning. By promoting evidence-grounded answers and recall-aware evaluation, ESGBench aims to accelerate transparent, standards-aligned ESG AI research and practical deployment, including multilingual and governance considerations.”

Abstract

We present ESGBench, a benchmark dataset and evaluation framework designed to assess explainable ESG question answering systems using corporate sustainability reports. The benchmark consists of domain-grounded questions across multiple ESG themes, paired with human-curated answers and supporting evidence to enable fine-grained evaluation of model reasoning. We analyze the performance of state-of-the-art LLMs on ESGBench, highlighting key challenges in factual consistency, traceability, and domain alignment. ESGBench aims to accelerate research in transparent and accountable ESG-focused AI systems.

ESGBench: A Benchmark for Explainable ESG Question Answering in Corporate Sustainability Reports

TL;DR

ESGBench addresses the challenge of explainable QA over ESG disclosures by providing a reproducible pipeline that ingests ESG/TCFD PDFs, builds a chunked and table-aware index, and generates QA pairs with verbatim evidence. It offers an evaluation suite with EM, F1, Numeric Accuracy, Recall@K, and per-category scores, plus a simple RAG baseline to highlight current limitations in numeric KPI grounding and table grounding. The dataset comprises 119 QA pairs from 10 companies, with 40–50% table-derived content, illustrating the need for robust numeric and table reasoning. By promoting evidence-grounded answers and recall-aware evaluation, ESGBench aims to accelerate transparent, standards-aligned ESG AI research and practical deployment, including multilingual and governance considerations.”

Abstract

We present ESGBench, a benchmark dataset and evaluation framework designed to assess explainable ESG question answering systems using corporate sustainability reports. The benchmark consists of domain-grounded questions across multiple ESG themes, paired with human-curated answers and supporting evidence to enable fine-grained evaluation of model reasoning. We analyze the performance of state-of-the-art LLMs on ESGBench, highlighting key challenges in factual consistency, traceability, and domain alignment. ESGBench aims to accelerate research in transparent and accountable ESG-focused AI systems.

Paper Structure

This paper contains 20 sections, 1 figure, 2 tables.

Figures (1)

  • Figure 1: ESGBench pipeline