EvoGrad: A Dynamic Take on the Winograd Schema Challenge with Human Adversaries

Jing Han Sun; Ali Emami

EvoGrad: A Dynamic Take on the Winograd Schema Challenge with Human Adversaries

Jing Han Sun, Ali Emami

TL;DR

This paper introduces EvoGrad, an open-source, human-in-the-loop platform to dynamically expand Winograd Schema Challenge data and probe common-sense reasoning. By combining human perturbations with scalable ChatGPT generation and WordNet augmentation, the authors grow the dataset from 182 to 3,691 instances and introduce an error depth metric to assess model stability under progressive perturbations. Empirical results show that even strong LLMs like GPT-3.5 struggle under dynamic perturbations (roughly 65-67% accuracy with average error depth around 7), while humans maintain high accuracy (~92.8-95.2%) without perturbation errors, underscoring persistent gaps and the value of EvoGrad. The work highlights EvoGrad's potential to diagnose and improve robustness in language models and outlines a path toward broader community involvement and application to additional NLP tasks.

Abstract

While Large Language Models (LLMs) excel at the Winograd Schema Challenge (WSC), a coreference resolution task testing common-sense reasoning through pronoun disambiguation, they struggle with instances that feature minor alterations or rewording. To address this, we introduce EvoGrad, an open-source platform that harnesses a human-in-the-loop approach to create a dynamic dataset tailored to such altered WSC instances. Leveraging ChatGPT's capabilities, we expand our task instances from 182 to 3,691, setting a new benchmark for diverse common-sense reasoning datasets. Additionally, we introduce the error depth metric, assessing model stability in dynamic tasks. Our results emphasize the challenge posed by EvoGrad: Even the best performing LLM, GPT-3.5, achieves an accuracy of 65.0% with an average error depth of 7.2, a stark contrast to human performance of 92. 8% accuracy without perturbation errors. This highlights ongoing model limitations and the value of dynamic datasets in uncovering them.

EvoGrad: A Dynamic Take on the Winograd Schema Challenge with Human Adversaries

TL;DR

Abstract

Paper Structure (25 sections, 2 equations, 3 figures, 5 tables)

This paper contains 25 sections, 2 equations, 3 figures, 5 tables.

Introduction
Related Work
WSC-based Datasets
Dynamic Datasets
Data Augmentation Methods in NLP
Large Language Models in Data Augmentation and Annotation
EvoGrad
Dataset Evolution by Perturbation
Scaling with ChatGPT
Scaling with Wordnet
The Dataset
The Platform
Error Depth
Human Performance
Experiments and Results
...and 10 more sections

Figures (3)

Figure 1: Interface of EvoGrad at https://evograd.com
Figure 2: Evolution figure of the sentence "Kevin yelled at Jim because he was so upset." up to depth level 2.
Figure 3: Distribution of the top three perturbations for models trained on Winogrande + Evograd-L. From left to right: BERT, RoBERTa, and ALBERT.The segments represent the relative frequency of each perturbation type: '--NN' (noun removal), '+NN' (noun addition), and either '--JJ' (adjective removal) or '--IN' (preposition removal).

EvoGrad: A Dynamic Take on the Winograd Schema Challenge with Human Adversaries

TL;DR

Abstract

EvoGrad: A Dynamic Take on the Winograd Schema Challenge with Human Adversaries

Authors

TL;DR

Abstract

Table of Contents

Figures (3)