CLaC at SemEval-2025 Task 6: A Multi-Architecture Approach for Corporate Environmental Promise Verification
Nawar Turk, Eeham Khan, Leila Kosseim
TL;DR
The paper addresses verifying corporate promises in ESG disclosures by framing four subtasks: promise identification, evidence assessment, clarity, and timing. It investigates three architectures—Baseline ESG-BERT, a Feature-Enhanced variant with linguistic cues, and a Combined Subtask model with attention pooling and multi-objective learning—on the English ML-Promise dataset. The Combined Subtask model achieves the best private leaderboard score of 0.5268, outperforming the Kaggle baseline of 0.5227, highlighting the value of multitask learning, contextual features, and test-time augmentation in imbalanced, small-data regimes. The work demonstrates that targeted linguistic signals and metadata enrichment can improve promise verification, while acknowledging data scarcity as a key constraint and suggesting directions for cross-lingual extension and systematic ablations of backbone models and feature strategies.
Abstract
This paper presents our approach to the SemEval-2025 Task~6 (PromiseEval), which focuses on verifying promises in corporate ESG (Environmental, Social, and Governance) reports. We explore three model architectures to address the four subtasks of promise identification, supporting evidence assessment, clarity evaluation, and verification timing. Our first model utilizes ESG-BERT with task-specific classifier heads, while our second model enhances this architecture with linguistic features tailored for each subtask. Our third approach implements a combined subtask model with attention-based sequence pooling, transformer representations augmented with document metadata, and multi-objective learning. Experiments on the English portion of the ML-Promise dataset demonstrate progressive improvement across our models, with our combined subtask approach achieving a leaderboard score of 0.5268, outperforming the provided baseline of 0.5227. Our work highlights the effectiveness of linguistic feature extraction, attention pooling, and multi-objective learning in promise verification tasks, despite challenges posed by class imbalance and limited training data.
