Table of Contents
Fetching ...

Judging It, Washing It: Scoring and Greenwashing Corporate Climate Disclosures using Large Language Models

Marianne Chuang, Gabriel Chuang, Cheryl Chuang, John Chuang

TL;DR

This work investigates the use of large language models to evaluate and potentially greenwash corporate climate disclosures. It compares two LLMJ scoring schemes—numerical rating on a $1$ to $5$ scale and pairwise comparison across $k=24$ peers—and finds that pairwise comparison generally provides stronger separation between high- and low-performing disclosures, with robustness to greenwashing under accuracy and length constraints. The study uses CDP 2022 data (1,416 European responses to Target and Progress questions, including 147 A-List members) and analyzes prompt variants (in-context learning, indicative scales, and chain-of-thought prompting). It also demonstrates that LLMs can greenwash responses when unconstrained, but that the LLMJ framework, particularly pairwise scoring, remains comparatively robust when content accuracy is enforced. Limitations include geographic and temporal scope and reliance on a single LLM, suggesting future work to broaden datasets and model diversity, potentially enabling scalable benchmarking for climate-disclosure quality.

Abstract

We study the use of large language models (LLMs) to both evaluate and greenwash corporate climate disclosures. First, we investigate the use of the LLM-as-a-Judge (LLMJ) methodology for scoring company-submitted reports on emissions reduction targets and progress. Second, we probe the behavior of an LLM when it is prompted to greenwash a response subject to accuracy and length constraints. Finally, we test the robustness of the LLMJ methodology against responses that may be greenwashed using an LLM. We find that two LLMJ scoring systems, numerical rating and pairwise comparison, are effective in distinguishing high-performing companies from others, with the pairwise comparison system showing greater robustness against LLM-greenwashed responses.

Judging It, Washing It: Scoring and Greenwashing Corporate Climate Disclosures using Large Language Models

TL;DR

This work investigates the use of large language models to evaluate and potentially greenwash corporate climate disclosures. It compares two LLMJ scoring schemes—numerical rating on a to scale and pairwise comparison across peers—and finds that pairwise comparison generally provides stronger separation between high- and low-performing disclosures, with robustness to greenwashing under accuracy and length constraints. The study uses CDP 2022 data (1,416 European responses to Target and Progress questions, including 147 A-List members) and analyzes prompt variants (in-context learning, indicative scales, and chain-of-thought prompting). It also demonstrates that LLMs can greenwash responses when unconstrained, but that the LLMJ framework, particularly pairwise scoring, remains comparatively robust when content accuracy is enforced. Limitations include geographic and temporal scope and reliance on a single LLM, suggesting future work to broaden datasets and model diversity, potentially enabling scalable benchmarking for climate-disclosure quality.

Abstract

We study the use of large language models (LLMs) to both evaluate and greenwash corporate climate disclosures. First, we investigate the use of the LLM-as-a-Judge (LLMJ) methodology for scoring company-submitted reports on emissions reduction targets and progress. Second, we probe the behavior of an LLM when it is prompted to greenwash a response subject to accuracy and length constraints. Finally, we test the robustness of the LLMJ methodology against responses that may be greenwashed using an LLM. We find that two LLMJ scoring systems, numerical rating and pairwise comparison, are effective in distinguishing high-performing companies from others, with the pairwise comparison system showing greater robustness against LLM-greenwashed responses.

Paper Structure

This paper contains 21 sections, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Baseline LLM-as-a-Judge prompts for numerical rating (left) and pairwise comparison (right).
  • Figure 2: Distribution of scores for A-List and non-A-List companies from numerical rating (left) and pairwise comparison (right).
  • Figure 3: The two scoring systems produce consistent results: there is high correlation between the two scores ($r^2 = 0.70$).
  • Figure 4: The prompt used to generate greenwashed responses. We generated responses with no constraints, accuracy constraints (red), and both accuracy (red) and length (blue) constraints.
  • Figure 5: An example CDP response, along with LLM-greenwashed variations under three sets of constraints. Changes are loosely labeled by type.
  • ...and 7 more figures