Towards Linguistically Generalizable NLP Systems: A Workshop and Shared Task
Allyson Ettinger, Sudha Rao, Hal Daumé, Emily M. Bender
TL;DR
The paper discusses a workshop and the Build It Break It, The Language Edition shared task designed to evaluate NLP systems beyond training distributions by testing linguistic generalization. It documents task design (building, breaking, scoring), data construction (sentiment and QA-SRL), participant systems and adversarial breakers, and scoring results that reveal robustness boundaries. Key contributions include a public dataset of minimal-pair test cases and a framework for adversarial evaluation that engages both NLP and linguistics communities. The work highlights lessons for future iterations to improve participation, labeling quality, and clarity of minimal-pair definitions, aiming to advance robust, linguistically aware NLP systems.
Abstract
This paper presents a summary of the first Workshop on Building Linguistically Generalizable Natural Language Processing Systems, and the associated Build It Break It, The Language Edition shared task. The goal of this workshop was to bring together researchers in NLP and linguistics with a shared task aimed at testing the generalizability of NLP systems beyond the distributions of their training data. We describe the motivation, setup, and participation of the shared task, provide discussion of some highlighted results, and discuss lessons learned.
