EvoGrad: A Dynamic Take on the Winograd Schema Challenge with Human Adversaries
Jing Han Sun, Ali Emami
TL;DR
This paper introduces EvoGrad, an open-source, human-in-the-loop platform to dynamically expand Winograd Schema Challenge data and probe common-sense reasoning. By combining human perturbations with scalable ChatGPT generation and WordNet augmentation, the authors grow the dataset from 182 to 3,691 instances and introduce an error depth metric to assess model stability under progressive perturbations. Empirical results show that even strong LLMs like GPT-3.5 struggle under dynamic perturbations (roughly 65-67% accuracy with average error depth around 7), while humans maintain high accuracy (~92.8-95.2%) without perturbation errors, underscoring persistent gaps and the value of EvoGrad. The work highlights EvoGrad's potential to diagnose and improve robustness in language models and outlines a path toward broader community involvement and application to additional NLP tasks.
Abstract
While Large Language Models (LLMs) excel at the Winograd Schema Challenge (WSC), a coreference resolution task testing common-sense reasoning through pronoun disambiguation, they struggle with instances that feature minor alterations or rewording. To address this, we introduce EvoGrad, an open-source platform that harnesses a human-in-the-loop approach to create a dynamic dataset tailored to such altered WSC instances. Leveraging ChatGPT's capabilities, we expand our task instances from 182 to 3,691, setting a new benchmark for diverse common-sense reasoning datasets. Additionally, we introduce the error depth metric, assessing model stability in dynamic tasks. Our results emphasize the challenge posed by EvoGrad: Even the best performing LLM, GPT-3.5, achieves an accuracy of 65.0% with an average error depth of 7.2, a stark contrast to human performance of 92. 8% accuracy without perturbation errors. This highlights ongoing model limitations and the value of dynamic datasets in uncovering them.
