Judging It, Washing It: Scoring and Greenwashing Corporate Climate Disclosures using Large Language Models
Marianne Chuang, Gabriel Chuang, Cheryl Chuang, John Chuang
TL;DR
This work investigates the use of large language models to evaluate and potentially greenwash corporate climate disclosures. It compares two LLMJ scoring schemes—numerical rating on a $1$ to $5$ scale and pairwise comparison across $k=24$ peers—and finds that pairwise comparison generally provides stronger separation between high- and low-performing disclosures, with robustness to greenwashing under accuracy and length constraints. The study uses CDP 2022 data (1,416 European responses to Target and Progress questions, including 147 A-List members) and analyzes prompt variants (in-context learning, indicative scales, and chain-of-thought prompting). It also demonstrates that LLMs can greenwash responses when unconstrained, but that the LLMJ framework, particularly pairwise scoring, remains comparatively robust when content accuracy is enforced. Limitations include geographic and temporal scope and reliance on a single LLM, suggesting future work to broaden datasets and model diversity, potentially enabling scalable benchmarking for climate-disclosure quality.
Abstract
We study the use of large language models (LLMs) to both evaluate and greenwash corporate climate disclosures. First, we investigate the use of the LLM-as-a-Judge (LLMJ) methodology for scoring company-submitted reports on emissions reduction targets and progress. Second, we probe the behavior of an LLM when it is prompted to greenwash a response subject to accuracy and length constraints. Finally, we test the robustness of the LLMJ methodology against responses that may be greenwashed using an LLM. We find that two LLMJ scoring systems, numerical rating and pairwise comparison, are effective in distinguishing high-performing companies from others, with the pairwise comparison system showing greater robustness against LLM-greenwashed responses.
