Corporate Greenwashing Detection in Text -- a Survey
Tom Calamai, Oana Balalau, Théo Le Guenedal, Fabian M. Suchanek
TL;DR
The paper surveys NLP methods for detecting greenwashing in climate-related corporate text, arguing that no gold-standard greenwashing dataset currently exists and that researchers rely on intermediary tasks to approximate detection. It clusters work into pretraining domain-specific models, climate-topic detection, and thematic analysis (TCFD/ESG), then expands to in-depth climate-risk classification, green claim detection, stance, Q&A, deceptive techniques, and environmental performance prediction. Across sections, the authors report that transformer-based models dominate with strong but sometimes limited gains, and that many results depend heavily on dataset definitions and labeling quality. They underscore major open challenges: evaluation methodology, model robustness to noise and adversarial inputs, data access and reproducibility, and the need to link texts to regulatory standards to ground judgments. Collectively, the survey maps a multi-layered NLP pipeline for greenwashing detection, highlights the gaps between theory and practice, and calls for real-world, regulator-aligned datasets to enable reliable, scalable detection and accountability in climate communications.
Abstract
Greenwashing is an effort to mislead the public about the environmental impact of an entity, such as a state or company. We provide a comprehensive survey of the scientific literature addressing natural language processing methods to identify potentially misleading climate-related corporate communications, indicative of greenwashing. We break the detection of greenwashing into intermediate tasks, and review the state-of-the-art approaches for each of them. We discuss datasets, methods, and results, as well as limitations and open challenges. We also provide an overview of how far the field has come as a whole, and point out future research directions.
