Table of Contents
Fetching ...

Is Less More? Quality, Quantity and Context in Idiom Processing with Natural Language Models

Agne Knietaite, Adam Allsebrook, Anton Minkov, Adam Tomaszewski, Norbert Slinko, Richard Johnson, Thomas Pickard, Dylan Phelps, Aline Villavicencio

TL;DR

The paper addresses how language models process idioms, proposing NCSSB datasets to systematically study the roles of data quality, data quantity, and contextual information in idiomaticity detection. It evaluates a range of modeling configurations, from a baseline mBERT setup to data augmentation and external-knowledge enhancements (ParaCOMET and glosses), using SemEval 2022 Task 2 Subtask B as the evaluation benchmark. Key findings show that data quality significantly boosts context-enhanced performance, while data quantity is more impactful for models without explicit knowledge integration; external knowledge generally improves idiom representation, particularly when context is targeted and high-quality data is available. The work highlights practical implications for building idiom-aware NLP systems, suggesting that a balance of high-quality data and external knowledge access yields the best outcomes in real-world settings.

Abstract

Compositionality in language models presents a problem when processing idiomatic expressions, as their meaning often cannot be directly derived from their individual parts. Although fine-tuning and other optimization strategies can be used to improve representations of idiomatic expressions, this depends on the availability of relevant data. We present the Noun Compound Synonym Substitution in Books - NCSSB - datasets, which are created by substitution of synonyms of potentially idiomatic English noun compounds in public domain book texts. We explore the trade-off between data quantity and quality when training models for idiomaticity detection, in conjunction with contextual information obtained locally (from the surrounding sentences) or externally (through language resources). Performance on an idiomaticity detection task indicates that dataset quality is a stronger factor for context-enriched models, but that quantity also plays a role in models without context inclusion strategies.

Is Less More? Quality, Quantity and Context in Idiom Processing with Natural Language Models

TL;DR

The paper addresses how language models process idioms, proposing NCSSB datasets to systematically study the roles of data quality, data quantity, and contextual information in idiomaticity detection. It evaluates a range of modeling configurations, from a baseline mBERT setup to data augmentation and external-knowledge enhancements (ParaCOMET and glosses), using SemEval 2022 Task 2 Subtask B as the evaluation benchmark. Key findings show that data quality significantly boosts context-enhanced performance, while data quantity is more impactful for models without explicit knowledge integration; external knowledge generally improves idiom representation, particularly when context is targeted and high-quality data is available. The work highlights practical implications for building idiom-aware NLP systems, suggesting that a balance of high-quality data and external knowledge access yields the best outcomes in real-world settings.

Abstract

Compositionality in language models presents a problem when processing idiomatic expressions, as their meaning often cannot be directly derived from their individual parts. Although fine-tuning and other optimization strategies can be used to improve representations of idiomatic expressions, this depends on the availability of relevant data. We present the Noun Compound Synonym Substitution in Books - NCSSB - datasets, which are created by substitution of synonyms of potentially idiomatic English noun compounds in public domain book texts. We explore the trade-off between data quantity and quality when training models for idiomaticity detection, in conjunction with contextual information obtained locally (from the surrounding sentences) or externally (through language resources). Performance on an idiomaticity detection task indicates that dataset quality is a stronger factor for context-enriched models, but that quantity also plays a role in models without context inclusion strategies.
Paper Structure (28 sections, 2 equations, 10 figures, 2 tables)

This paper contains 28 sections, 2 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: SemEval Task 2 Subtask B sentence structures used in Bronze dataset generation
  • Figure 2: Performance with differing number of glosses, $n$, on Silver 1 dataset.
  • Figure 3: Performance of Gold and Silver 10 datasets against randomly-sampled Bronze datasets of the same size.
  • Figure 4: Impact of data augmentations applied to the Gold and Gold + SemEval datasets.
  • Figure 5: Performance of sentence and paragraph context baseline models as dataset size increases.
  • ...and 5 more figures