Table of Contents
Fetching ...

ESGReveal: An LLM-based approach for extracting structured data from ESG reports

Yi Zou, Mengying Shi, Zhongjie Chen, Zhu Deng, ZongXiong Lei, Zihan Zeng, Shiming Yang, HongXiang Tong, Lei Xiao, Wenwen Zhou

TL;DR

ESGReveal introduces a structured, LLM-assisted pipeline that leverages Retrieval Augmented Generation (RAG) to extract numerical and textual ESG indicators from corporate reports. The system combines an ESG metadata module, a report preprocessing stage, and an LLM agent to support targeted query, retrieval, and extraction of data aligned with HKEx ESG standards. On 2022 HKEx ESG reports from 166 companies, the GPT-4-based implementation achieves $Acc_{DE}=76.9\%$ and $Acc_{DC}=83.7\%$, with environmental disclosure averaging $69.5\%$ and social disclosure at $57.2\%$, indicating substantial gains over baselines but highlighting remaining transparency gaps. Ablation studies show that enhanced preprocessing and domain knowledge integration substantially improve performance, suggesting strong potential for broader deployment and extension to pictorial data and other sustainability frameworks.

Abstract

ESGReveal is an innovative method proposed for efficiently extracting and analyzing Environmental, Social, and Governance (ESG) data from corporate reports, catering to the critical need for reliable ESG information retrieval. This approach utilizes Large Language Models (LLM) enhanced with Retrieval Augmented Generation (RAG) techniques. The ESGReveal system includes an ESG metadata module for targeted queries, a preprocessing module for assembling databases, and an LLM agent for data extraction. Its efficacy was appraised using ESG reports from 166 companies across various sectors listed on the Hong Kong Stock Exchange in 2022, ensuring comprehensive industry and market capitalization representation. Utilizing ESGReveal unearthed significant insights into ESG reporting with GPT-4, demonstrating an accuracy of 76.9% in data extraction and 83.7% in disclosure analysis, which is an improvement over baseline models. This highlights the framework's capacity to refine ESG data analysis precision. Moreover, it revealed a demand for reinforced ESG disclosures, with environmental and social data disclosures standing at 69.5% and 57.2%, respectively, suggesting a pursuit for more corporate transparency. While current iterations of ESGReveal do not process pictorial information, a functionality intended for future enhancement, the study calls for continued research to further develop and compare the analytical capabilities of various LLMs. In summary, ESGReveal is a stride forward in ESG data processing, offering stakeholders a sophisticated tool to better evaluate and advance corporate sustainability efforts. Its evolution is promising in promoting transparency in corporate reporting and aligning with broader sustainable development aims.

ESGReveal: An LLM-based approach for extracting structured data from ESG reports

TL;DR

ESGReveal introduces a structured, LLM-assisted pipeline that leverages Retrieval Augmented Generation (RAG) to extract numerical and textual ESG indicators from corporate reports. The system combines an ESG metadata module, a report preprocessing stage, and an LLM agent to support targeted query, retrieval, and extraction of data aligned with HKEx ESG standards. On 2022 HKEx ESG reports from 166 companies, the GPT-4-based implementation achieves and , with environmental disclosure averaging and social disclosure at , indicating substantial gains over baselines but highlighting remaining transparency gaps. Ablation studies show that enhanced preprocessing and domain knowledge integration substantially improve performance, suggesting strong potential for broader deployment and extension to pictorial data and other sustainability frameworks.

Abstract

ESGReveal is an innovative method proposed for efficiently extracting and analyzing Environmental, Social, and Governance (ESG) data from corporate reports, catering to the critical need for reliable ESG information retrieval. This approach utilizes Large Language Models (LLM) enhanced with Retrieval Augmented Generation (RAG) techniques. The ESGReveal system includes an ESG metadata module for targeted queries, a preprocessing module for assembling databases, and an LLM agent for data extraction. Its efficacy was appraised using ESG reports from 166 companies across various sectors listed on the Hong Kong Stock Exchange in 2022, ensuring comprehensive industry and market capitalization representation. Utilizing ESGReveal unearthed significant insights into ESG reporting with GPT-4, demonstrating an accuracy of 76.9% in data extraction and 83.7% in disclosure analysis, which is an improvement over baseline models. This highlights the framework's capacity to refine ESG data analysis precision. Moreover, it revealed a demand for reinforced ESG disclosures, with environmental and social data disclosures standing at 69.5% and 57.2%, respectively, suggesting a pursuit for more corporate transparency. While current iterations of ESGReveal do not process pictorial information, a functionality intended for future enhancement, the study calls for continued research to further develop and compare the analytical capabilities of various LLMs. In summary, ESGReveal is a stride forward in ESG data processing, offering stakeholders a sophisticated tool to better evaluate and advance corporate sustainability efforts. Its evolution is promising in promoting transparency in corporate reporting and aligning with broader sustainable development aims.
Paper Structure (25 sections, 4 equations, 7 figures, 9 tables)

This paper contains 25 sections, 4 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Overall structure of ESGReveal.
  • Figure 1: Disclosure Levels and Market Value Across Different Companies in Healthcare.
  • Figure 2: Structure of ESG metadata module: Entities, Extensions, and Expressions.
  • Figure 2: Key Actions and Word Frequencies Under Different Environmental Issues.
  • Figure 3: Ablation Study of ESGReveal
  • ...and 2 more figures