Table of Contents
Fetching ...

Nano-ESG: Extracting Corporate Sustainability Information from News Articles

Fabian Billert, Stefan Conrad

TL;DR

Nano-ESG addresses the opacity of traditional ESG scores by constructing an open, time-stamped dataset of ESG-relevant information extracted from a large corpus of news articles about German DAX 40 companies. The authors design a multi-stage processing pipeline that filters, deduplicates, and annotates articles with ESG-sentiment and ESG-aspect using LLMs (GPT-3.5, GPT-4o, and variants) before delivering 51,087 relevant articles with per-article summaries and sources. A rigorous human-in-the-loop evaluation shows high agreement on summaries (95.9%) and substantial agreement on sentiment (Fleiss' κ ≈ 0.818) and moderate agreement on aspects (Fleiss' κ ≈ 0.427), supporting the reliability of the data. They also demonstrate topic detection over time via BERTopic, linking ESG topics to real-world events (e.g., forced-labor concerns with Volkswagen), and provide an open-source release for researchers and practitioners to monitor corporate sustainability dynamics.

Abstract

Determining the sustainability impact of companies is a highly complex subject which has garnered more and more attention over the past few years. Today, investors largely rely on sustainability-ratings from established rating-providers in order to analyze how responsibly a company acts. However, those ratings have recently been criticized for being hard to understand and nearly impossible to reproduce. An independent way to find out about the sustainability practices of companies lies in the rich landscape of news article data. In this paper, we explore a different approach to identify key opportunities and challenges of companies in the sustainability domain. We present a novel dataset of more than 840,000 news articles which were gathered for major German companies between January 2023 and September 2024. By applying a mixture of Natural Language Processing techniques, we first identify relevant articles, before summarizing them and extracting their sustainability-related sentiment and aspect using Large Language Models (LLMs). Furthermore, we conduct an evaluation of the obtained data and determine that the LLM-produced answers are accurate. We release both datasets at https://github.com/Bailefan/Nano-ESG.

Nano-ESG: Extracting Corporate Sustainability Information from News Articles

TL;DR

Nano-ESG addresses the opacity of traditional ESG scores by constructing an open, time-stamped dataset of ESG-relevant information extracted from a large corpus of news articles about German DAX 40 companies. The authors design a multi-stage processing pipeline that filters, deduplicates, and annotates articles with ESG-sentiment and ESG-aspect using LLMs (GPT-3.5, GPT-4o, and variants) before delivering 51,087 relevant articles with per-article summaries and sources. A rigorous human-in-the-loop evaluation shows high agreement on summaries (95.9%) and substantial agreement on sentiment (Fleiss' κ ≈ 0.818) and moderate agreement on aspects (Fleiss' κ ≈ 0.427), supporting the reliability of the data. They also demonstrate topic detection over time via BERTopic, linking ESG topics to real-world events (e.g., forced-labor concerns with Volkswagen), and provide an open-source release for researchers and practitioners to monitor corporate sustainability dynamics.

Abstract

Determining the sustainability impact of companies is a highly complex subject which has garnered more and more attention over the past few years. Today, investors largely rely on sustainability-ratings from established rating-providers in order to analyze how responsibly a company acts. However, those ratings have recently been criticized for being hard to understand and nearly impossible to reproduce. An independent way to find out about the sustainability practices of companies lies in the rich landscape of news article data. In this paper, we explore a different approach to identify key opportunities and challenges of companies in the sustainability domain. We present a novel dataset of more than 840,000 news articles which were gathered for major German companies between January 2023 and September 2024. By applying a mixture of Natural Language Processing techniques, we first identify relevant articles, before summarizing them and extracting their sustainability-related sentiment and aspect using Large Language Models (LLMs). Furthermore, we conduct an evaluation of the obtained data and determine that the LLM-produced answers are accurate. We release both datasets at https://github.com/Bailefan/Nano-ESG.

Paper Structure

This paper contains 21 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Processing-pipeline (left) used for the creation of the dataset and how much each step reduces the amount of data relative to the starting point (middle) and relative to the previous step (right).
  • Figure 2: Number of articles with different ESG-aspects per company. Top: total number. Bottom: Ratio of each aspect per company.
  • Figure 3: Left: Weekly number of articles in total and for each ESG-aspect. Right: 30-day moving average of the sentiment per ESG-aspect.
  • Figure 4: Amount of positive and negative articles regarding the "Forced Labor" topic of Volkswagen over time for the different detected aspects.