Table of Contents
Fetching ...

SustainableQA: A Comprehensive Question Answering Dataset for Corporate Sustainability and EU Taxonomy Reporting

Mohammed Ali, Abdelrahman Abdallah, Adam Jatowt

TL;DR

SustainableQA tackles the urgent need for high-quality, domain-specific QA data to support retrieval-augmented systems for corporate sustainability and EU Taxonomy reporting. It introduces a scalable pipeline that fuses semantic passage classification, a hybrid span extraction workflow, and a table-to-paragraph transformation to generate over 195k QA pairs from 61 real-world reports, followed by an automated faithfulness-and-relevance refinement process. Empirical results show that a compact 8B parameter model, fine-tuned on SustainableQA, can outperform larger state-of-the-art models under various prompting strategies, and that the dataset provides strong utility for RAG benchmarks and domain-specific QA tasks. The work advances reproducible evaluation for regulation-aware QA in sustainability contexts and highlights avenues for multimodal extension and broader regulatory applicability.

Abstract

The growing demand for corporate sustainability transparency, particularly under new regulations like the EU Taxonomy, necessitates precise data extraction from large, unstructured corporate reports, a task for which Large Language Models and Retrieval-RAG systems require high-quality, domain-specific question-answering datasets. To address this, we introduce SustainableQA, a novel dataset and a scalable pipeline that generates comprehensive QA pairs from corporate sustainability and annual reports by integrating semantic chunk classification, a hybrid span extraction pipeline, and a specialized table-to-paragraph transformation. To ensure high quality, the generation is followed by a novel automated assessment and refinement pipeline that systematically validates each QA pair for faithfulness and relevance, repairing or discarding low-quality entries. This results in a final, robust dataset of over 195,000 diverse factoid and non-factoid QA pairs, whose effectiveness is demonstrated by initial fine-tuning experiments where a compact 8B parameter model significantly outperforms much larger state-of-the-art models. SustainableQA proves to be a highly effective resource for developing and benchmarking advanced knowledge assistants capable of navigating complex sustainability compliance data.

SustainableQA: A Comprehensive Question Answering Dataset for Corporate Sustainability and EU Taxonomy Reporting

TL;DR

SustainableQA tackles the urgent need for high-quality, domain-specific QA data to support retrieval-augmented systems for corporate sustainability and EU Taxonomy reporting. It introduces a scalable pipeline that fuses semantic passage classification, a hybrid span extraction workflow, and a table-to-paragraph transformation to generate over 195k QA pairs from 61 real-world reports, followed by an automated faithfulness-and-relevance refinement process. Empirical results show that a compact 8B parameter model, fine-tuned on SustainableQA, can outperform larger state-of-the-art models under various prompting strategies, and that the dataset provides strong utility for RAG benchmarks and domain-specific QA tasks. The work advances reproducible evaluation for regulation-aware QA in sustainability contexts and highlights avenues for multimodal extension and broader regulatory applicability.

Abstract

The growing demand for corporate sustainability transparency, particularly under new regulations like the EU Taxonomy, necessitates precise data extraction from large, unstructured corporate reports, a task for which Large Language Models and Retrieval-RAG systems require high-quality, domain-specific question-answering datasets. To address this, we introduce SustainableQA, a novel dataset and a scalable pipeline that generates comprehensive QA pairs from corporate sustainability and annual reports by integrating semantic chunk classification, a hybrid span extraction pipeline, and a specialized table-to-paragraph transformation. To ensure high quality, the generation is followed by a novel automated assessment and refinement pipeline that systematically validates each QA pair for faithfulness and relevance, repairing or discarding low-quality entries. This results in a final, robust dataset of over 195,000 diverse factoid and non-factoid QA pairs, whose effectiveness is demonstrated by initial fine-tuning experiments where a compact 8B parameter model significantly outperforms much larger state-of-the-art models. SustainableQA proves to be a highly effective resource for developing and benchmarking advanced knowledge assistants capable of navigating complex sustainability compliance data.

Paper Structure

This paper contains 53 sections, 5 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: SustainableQA dataset generation pipeline.
  • Figure 2: Span distribution by category.
  • Figure 3: Three-stage quality assessment and refinement pipeline with variables: F (faithfulness score), R (relevance score), Q (question), A (answer), K (maximum refinement attempts).