Is Less More? Quality, Quantity and Context in Idiom Processing with Natural Language Models
Agne Knietaite, Adam Allsebrook, Anton Minkov, Adam Tomaszewski, Norbert Slinko, Richard Johnson, Thomas Pickard, Dylan Phelps, Aline Villavicencio
TL;DR
The paper addresses how language models process idioms, proposing NCSSB datasets to systematically study the roles of data quality, data quantity, and contextual information in idiomaticity detection. It evaluates a range of modeling configurations, from a baseline mBERT setup to data augmentation and external-knowledge enhancements (ParaCOMET and glosses), using SemEval 2022 Task 2 Subtask B as the evaluation benchmark. Key findings show that data quality significantly boosts context-enhanced performance, while data quantity is more impactful for models without explicit knowledge integration; external knowledge generally improves idiom representation, particularly when context is targeted and high-quality data is available. The work highlights practical implications for building idiom-aware NLP systems, suggesting that a balance of high-quality data and external knowledge access yields the best outcomes in real-world settings.
Abstract
Compositionality in language models presents a problem when processing idiomatic expressions, as their meaning often cannot be directly derived from their individual parts. Although fine-tuning and other optimization strategies can be used to improve representations of idiomatic expressions, this depends on the availability of relevant data. We present the Noun Compound Synonym Substitution in Books - NCSSB - datasets, which are created by substitution of synonyms of potentially idiomatic English noun compounds in public domain book texts. We explore the trade-off between data quantity and quality when training models for idiomaticity detection, in conjunction with contextual information obtained locally (from the surrounding sentences) or externally (through language resources). Performance on an idiomaticity detection task indicates that dataset quality is a stronger factor for context-enriched models, but that quantity also plays a role in models without context inclusion strategies.
