Nano-ESG: Extracting Corporate Sustainability Information from News Articles
Fabian Billert, Stefan Conrad
TL;DR
Nano-ESG addresses the opacity of traditional ESG scores by constructing an open, time-stamped dataset of ESG-relevant information extracted from a large corpus of news articles about German DAX 40 companies. The authors design a multi-stage processing pipeline that filters, deduplicates, and annotates articles with ESG-sentiment and ESG-aspect using LLMs (GPT-3.5, GPT-4o, and variants) before delivering 51,087 relevant articles with per-article summaries and sources. A rigorous human-in-the-loop evaluation shows high agreement on summaries (95.9%) and substantial agreement on sentiment (Fleiss' κ ≈ 0.818) and moderate agreement on aspects (Fleiss' κ ≈ 0.427), supporting the reliability of the data. They also demonstrate topic detection over time via BERTopic, linking ESG topics to real-world events (e.g., forced-labor concerns with Volkswagen), and provide an open-source release for researchers and practitioners to monitor corporate sustainability dynamics.
Abstract
Determining the sustainability impact of companies is a highly complex subject which has garnered more and more attention over the past few years. Today, investors largely rely on sustainability-ratings from established rating-providers in order to analyze how responsibly a company acts. However, those ratings have recently been criticized for being hard to understand and nearly impossible to reproduce. An independent way to find out about the sustainability practices of companies lies in the rich landscape of news article data. In this paper, we explore a different approach to identify key opportunities and challenges of companies in the sustainability domain. We present a novel dataset of more than 840,000 news articles which were gathered for major German companies between January 2023 and September 2024. By applying a mixture of Natural Language Processing techniques, we first identify relevant articles, before summarizing them and extracting their sustainability-related sentiment and aspect using Large Language Models (LLMs). Furthermore, we conduct an evaluation of the obtained data and determine that the LLM-produced answers are accurate. We release both datasets at https://github.com/Bailefan/Nano-ESG.
