Toward Trustworthy Evaluation of Sustainability Rating Methodologies: A Human-AI Collaborative Framework for Benchmark Dataset Construction

Xiaoran Cai; Wang Yang; Xiyu Ren; Chekun Law; Rohit Sharma; Peng Qi

Toward Trustworthy Evaluation of Sustainability Rating Methodologies: A Human-AI Collaborative Framework for Benchmark Dataset Construction

Xiaoran Cai, Wang Yang, Xiyu Ren, Chekun Law, Rohit Sharma, Peng Qi

TL;DR

A universal human-AI collaboration framework is proposed to generate trustworthy benchmark datasets for evaluating sustainability rating methodologies and calls on the broader AI community to adopt AI-powered approaches to strengthen and advance sustainability rating methodologies that support and enforce urgent sustainability agendas.

Abstract

Sustainability or ESG rating agencies use company disclosures and external data to produce scores or ratings that assess the environmental, social, and governance performance of a company. However, sustainability ratings across agencies for a single company vary widely, limiting their comparability, credibility, and relevance to decision-making. To harmonize the rating results, we propose adopting a universal human-AI collaboration framework to generate trustworthy benchmark datasets for evaluating sustainability rating methodologies. The framework comprises two complementary parts: STRIDE (Sustainability Trust Rating & Integrity Data Equation) provides principled criteria and a scoring system that guide the construction of firm-level benchmark datasets using large language models (LLMs), and SR-Delta, a discrepancy-analysis procedural framework that surfaces insights for potential adjustments. The framework enables scalable and comparable assessment of sustainability rating methodologies. We call on the broader AI community to adopt AI-powered approaches to strengthen and advance sustainability rating methodologies that support and enforce urgent sustainability agendas.

Toward Trustworthy Evaluation of Sustainability Rating Methodologies: A Human-AI Collaborative Framework for Benchmark Dataset Construction

TL;DR

Abstract

Paper Structure (22 sections, 69 equations, 9 figures, 3 tables)

This paper contains 22 sections, 69 equations, 9 figures, 3 tables.

Introduction
Key Challenges in AI-Based Sustainability Rating Benchmarking
STRIDE: Sustainability Trust Rating and Integrity Data Equation
Human-Machine Trust Equation.
Credibility
Reliability
Intimacy
Self-served Purpose
SR-Delta: Sustainability Rating Delta for Methodology Improvement
Alternative Views
Conclusion
Limitations
Equation
A Case Study - Luxshare Precision Industry Co. Ltd ("Luxshare")
Credibility - Raw Input Data Selection
...and 7 more sections

Figures (9)

Figure 1: STRIDE framework for trustworthy sustainability rating benchmark datasets. Trust is modeled as a function of credibility, reliability, and human–AI intimacy (positive contributors), and self-serving purpose (negative contributor).
Figure 2: STRIDE trust formulation and component decomposition.
Figure 3: Overview of the STRIDE-guided discrepancy analysis framework. Rating outcomes generated using an existing sustainability rating methodology (A) are compared against ratings produced using STRIDE-guided benchmark data to identify systematic divergences.
Figure 4: The Fortune Global 500 comprises firms with diversity across geographic regions and industry sectors.
Figure 5: In our case study, metrics across all levels are consolidated and de-duplicated using LLMs. We recommend incorporating humans in the loop as suggested in the dashed box.
...and 4 more figures

Toward Trustworthy Evaluation of Sustainability Rating Methodologies: A Human-AI Collaborative Framework for Benchmark Dataset Construction

TL;DR

Abstract

Toward Trustworthy Evaluation of Sustainability Rating Methodologies: A Human-AI Collaborative Framework for Benchmark Dataset Construction

Authors

TL;DR

Abstract

Table of Contents

Figures (9)