Table of Contents
Fetching ...

ESG Accountability Made Easy: DocQA at Your Service

Lokesh Mishra, Cesar Berrospi, Kasper Dinkla, Diego Antognini, Francesco Fusco, Benedikt Bothur, Maksym Lysak, Nikolaos Livathinos, Ahmed Nassar, Panagiotis Vagenas, Lucas Morin, Christoph Auer, Michele Dolfi, Peter Staar

TL;DR

ESG data are increasingly disclosed in PDFs, but extracting actionable information is difficult due to non-machine-readable formats. The authors propose Deep Search DocQA, an end-to-end Retrieval-Augmented QA system that converts documents, encodes content into a vector store, and uses LLMs to generate grounded answers with context from the original ESG reports. Key contributions include an architecture that handles PDF/scan conversion, structured data extraction, top-$k$ passage retrieval with SentenceTransformers, and grounding/safety checks for generated responses, enabling scalable QA over $10{,}000$ ESG reports from $>2000$ corporations. The work demonstrates practical impact by making ESG disclosures more accessible to researchers, policymakers, and practitioners and points to future extensions to multi-document queries and other document domains.

Abstract

We present Deep Search DocQA. This application enables information extraction from documents via a question-answering conversational assistant. The system integrates several technologies from different AI disciplines consisting of document conversion to machine-readable format (via computer vision), finding relevant data (via natural language processing), and formulating an eloquent response (via large language models). Users can explore over 10,000 Environmental, Social, and Governance (ESG) disclosure reports from over 2000 corporations. The Deep Search platform can be accessed at: https://ds4sd.github.io.

ESG Accountability Made Easy: DocQA at Your Service

TL;DR

ESG data are increasingly disclosed in PDFs, but extracting actionable information is difficult due to non-machine-readable formats. The authors propose Deep Search DocQA, an end-to-end Retrieval-Augmented QA system that converts documents, encodes content into a vector store, and uses LLMs to generate grounded answers with context from the original ESG reports. Key contributions include an architecture that handles PDF/scan conversion, structured data extraction, top- passage retrieval with SentenceTransformers, and grounding/safety checks for generated responses, enabling scalable QA over ESG reports from corporations. The work demonstrates practical impact by making ESG disclosures more accessible to researchers, policymakers, and practitioners and points to future extensions to multi-document queries and other document domains.

Abstract

We present Deep Search DocQA. This application enables information extraction from documents via a question-answering conversational assistant. The system integrates several technologies from different AI disciplines consisting of document conversion to machine-readable format (via computer vision), finding relevant data (via natural language processing), and formulating an eloquent response (via large language models). Users can explore over 10,000 Environmental, Social, and Governance (ESG) disclosure reports from over 2000 corporations. The Deep Search platform can be accessed at: https://ds4sd.github.io.
Paper Structure (4 sections, 1 figure, 1 table)

This paper contains 4 sections, 1 figure, 1 table.

Figures (1)

  • Figure 1: System architecture: Simplified sketch of document question-answering pipeline.