Table of Contents
Fetching ...

Glitter or Gold? Deriving Structured Insights from Sustainability Reports via Large Language Models

Marco Bronzini, Carlo Nicolini, Bruno Lepri, Andrea Passerini, Jacopo Staiano

TL;DR

This paper tackles the problem of extracting structured, actionable ESG-related insights from unstructured sustainability reports. It proposes a pipeline that combines instruction-tuned Large Language Models with In-context Learning and Retrieval-Augmented Generation to produce ESG category–predicate–object triples, which are organized into a Knowledge Graph and three bipartite graphs for analysis. Through statistical analyses and SHAP-based interpretability, the authors show that ESG topics are expansive (over 500 topics), that disclosures cluster by region and sector, and that ESG disclosures more strongly influence ESG ratings than other company attributes. The approach demonstrates the value of integrating generative NLP, graph representations, and interpretable analytics to enhance transparency and comparability of CSR disclosures and ratings.

Abstract

Over the last decade, several regulatory bodies have started requiring the disclosure of non-financial information from publicly listed companies, in light of the investors' increasing attention to Environmental, Social, and Governance (ESG) issues. Publicly released information on sustainability practices is often disclosed in diverse, unstructured, and multi-modal documentation. This poses a challenge in efficiently gathering and aligning the data into a unified framework to derive insights related to Corporate Social Responsibility (CSR). Thus, using Information Extraction (IE) methods becomes an intuitive choice for delivering insightful and actionable data to stakeholders. In this study, we employ Large Language Models (LLMs), In-Context Learning, and the Retrieval-Augmented Generation (RAG) paradigm to extract structured insights related to ESG aspects from companies' sustainability reports. We then leverage graph-based representations to conduct statistical analyses concerning the extracted insights. These analyses revealed that ESG criteria cover a wide range of topics, exceeding 500, often beyond those considered in existing categorizations, and are addressed by companies through a variety of initiatives. Moreover, disclosure similarities emerged among companies from the same region or sector, validating ongoing hypotheses in the ESG literature. Lastly, by incorporating additional company attributes into our analyses, we investigated which factors impact the most on companies' ESG ratings, showing that ESG disclosure affects the obtained ratings more than other financial or company data.

Glitter or Gold? Deriving Structured Insights from Sustainability Reports via Large Language Models

TL;DR

This paper tackles the problem of extracting structured, actionable ESG-related insights from unstructured sustainability reports. It proposes a pipeline that combines instruction-tuned Large Language Models with In-context Learning and Retrieval-Augmented Generation to produce ESG category–predicate–object triples, which are organized into a Knowledge Graph and three bipartite graphs for analysis. Through statistical analyses and SHAP-based interpretability, the authors show that ESG topics are expansive (over 500 topics), that disclosures cluster by region and sector, and that ESG disclosures more strongly influence ESG ratings than other company attributes. The approach demonstrates the value of integrating generative NLP, graph representations, and interpretable analytics to enhance transparency and comparability of CSR disclosures and ratings.

Abstract

Over the last decade, several regulatory bodies have started requiring the disclosure of non-financial information from publicly listed companies, in light of the investors' increasing attention to Environmental, Social, and Governance (ESG) issues. Publicly released information on sustainability practices is often disclosed in diverse, unstructured, and multi-modal documentation. This poses a challenge in efficiently gathering and aligning the data into a unified framework to derive insights related to Corporate Social Responsibility (CSR). Thus, using Information Extraction (IE) methods becomes an intuitive choice for delivering insightful and actionable data to stakeholders. In this study, we employ Large Language Models (LLMs), In-Context Learning, and the Retrieval-Augmented Generation (RAG) paradigm to extract structured insights related to ESG aspects from companies' sustainability reports. We then leverage graph-based representations to conduct statistical analyses concerning the extracted insights. These analyses revealed that ESG criteria cover a wide range of topics, exceeding 500, often beyond those considered in existing categorizations, and are addressed by companies through a variety of initiatives. Moreover, disclosure similarities emerged among companies from the same region or sector, validating ongoing hypotheses in the ESG literature. Lastly, by incorporating additional company attributes into our analyses, we investigated which factors impact the most on companies' ESG ratings, showing that ESG disclosure affects the obtained ratings more than other financial or company data.
Paper Structure (58 sections, 3 equations, 10 figures, 11 tables)

This paper contains 58 sections, 3 equations, 10 figures, 11 tables.

Figures (10)

  • Figure 1: Our proposed approach and its components. Given a collection of textual documents as inputs and preprocessed using different NLP methods, semantically structured insights are extracted by the retrieval-augmented triple generator. Bipartite graphs are created after performing a semantic-based triple clustering.
  • Figure 2: Two labelled examples included within our model instruction. The input sentences were created to cover different syntactical structures, and do not represent actual information.
  • Figure 3: The instruction template used to prompt the generative LLM. The instruction template is created and compiled by the Kor library to prompt an LLM to extract structured data using In-Context Learning. DATASCHEMA is a placeholder for the output data schema. EXAMPLES is a placeholder for the labelled examples in the format of input-output pairs. While INPUT represents an ESG-related sentence from which structured data needs to be retrieved.
  • Figure 4: An example of a portion of the Knowledge Graph generated using our methodology. It portrays the ESG-oriented triples generated using our approach. Blue nodes represent company nodes which are connected to the ESG categories (green nodes) disclosed in the companies' sustainability reports. Category nodes (cat) are connected via a labelled edge (pred) to the predicate object (obj, grey nodes).
  • Figure 5: Descriptive statistics of the numerical features. They are used as part of the predictors of the Ordinary Least Squares (OLS) model. The features labelled with the starting word “category" reflect the percentage disclosure of an ESG topic. To enhance readability, the figure exhibits only three category features for explanatory purposes. The statistics are presented before standardisation.
  • ...and 5 more figures