Table of Contents
Fetching ...

Statements: Universal Information Extraction from Tables with Large Language Models for ESG KPIs

Lokesh Mishra, Sohayl Dhibi, Yusik Kim, Cesar Berrospi Ramis, Shubham Gupta, Michele Dolfi, Peter Staar

TL;DR

Investigating a family of T5-based Statement Extraction Models, the best model generates statements which are 82% similar to the ground-truth (compared to baseline of 21%).

Abstract

Environment, Social, and Governance (ESG) KPIs assess an organization's performance on issues such as climate change, greenhouse gas emissions, water consumption, waste management, human rights, diversity, and policies. ESG reports convey this valuable quantitative information through tables. Unfortunately, extracting this information is difficult due to high variability in the table structure as well as content. We propose Statements, a novel domain agnostic data structure for extracting quantitative facts and related information. We propose translating tables to statements as a new supervised deep-learning universal information extraction task. We introduce SemTabNet - a dataset of over 100K annotated tables. Investigating a family of T5-based Statement Extraction Models, our best model generates statements which are 82% similar to the ground-truth (compared to baseline of 21%). We demonstrate the advantages of statements by applying our model to over 2700 tables from ESG reports. The homogeneous nature of statements permits exploratory data analysis on expansive information found in large collections of ESG reports.

Statements: Universal Information Extraction from Tables with Large Language Models for ESG KPIs

TL;DR

Investigating a family of T5-based Statement Extraction Models, the best model generates statements which are 82% similar to the ground-truth (compared to baseline of 21%).

Abstract

Environment, Social, and Governance (ESG) KPIs assess an organization's performance on issues such as climate change, greenhouse gas emissions, water consumption, waste management, human rights, diversity, and policies. ESG reports convey this valuable quantitative information through tables. Unfortunately, extracting this information is difficult due to high variability in the table structure as well as content. We propose Statements, a novel domain agnostic data structure for extracting quantitative facts and related information. We propose translating tables to statements as a new supervised deep-learning universal information extraction task. We introduce SemTabNet - a dataset of over 100K annotated tables. Investigating a family of T5-based Statement Extraction Models, our best model generates statements which are 82% similar to the ground-truth (compared to baseline of 21%). We demonstrate the advantages of statements by applying our model to over 2700 tables from ESG reports. The homogeneous nature of statements permits exploratory data analysis on expansive information found in large collections of ESG reports.
Paper Structure (14 sections, 5 figures, 9 tables, 6 algorithms)

This paper contains 14 sections, 5 figures, 9 tables, 6 algorithms.

Figures (5)

  • Figure 1: The knowledge model of Statements represented as a tree. From the root node, individual statements emerge as branches. Associated with each individual statement node are the leaf predicate nodes.
  • Figure 2: A diagram explaining the framework introduced in this paper. We fine-tune LLMs on the task of 'Statement Extraction' leading to a family of "Statement Extraction Models" (SEM). Quantitative facts are extracted from heterogenous unstructured data (only tables in this paper) and stored as Statements.
  • Figure 3: Input and output for the task of "Statement Extraction". Top Left: Page from an ESG report containing tables. Top Right: One of the table, from the same page, prepared as markdown for model input. Bottom Left: Model output for the task of indirect statement extraction. Bottom Right: Model output for the task of direct statement extraction.
  • Figure 4: Exploratory data analysis of statements from over 2700 Tables published in ESG reports in 2022. Top: We searched about 50,000 predicates using keywords (shown on the x-axis) related to environment (left), social (middle), and governance (right). The plot shows the distribution of predicates and the number of organizations from this search. Bottom: Box plot for extracted Scope 1 and Scope 2 emission values grouped by business sectors from over 300 companies across multiple years. Only sectors with more than 20 data points are included.
  • Figure 5: Example table from an ESG report with a complicated layout. To extract the information content of a single cell (highlighted in red), the content and relationships (lines drawn in red) to many other cells (highlighted in orange) also needs to be understood.