Finding Challenging Metaphors that Confuse Pretrained Language Models
Yucheng Li, Frank Guerin, Chenghua Lin
TL;DR
The paper investigates which metaphors truly challenge pretrained language models, revealing that many VUA metaphors are easy for modern NLP systems. It introduces a model-specific hard metaphor detection framework that uses Word Sense Disambiguation and contrastive learning to produce Sense-Only Representations, paired with an overlap ratio $\varphi$ to flag hard cases. A RoBERTa-tailored Hard Metaphor Dataset (HMD) with 21k examples across 82 words and 110 senses is built, and downstream evaluation across MT, NLI, QA, and metaphor identification shows substantial degradation on hard metaphors, highlighting the role of context over novelty. The work presents hard metaphors as a meaningful benchmark for metaphor processing, promotes targeted, model-aware testing, and suggests directions toward long-tail and symbolic approaches to improve downstream robustness.
Abstract
Metaphors are considered to pose challenges for a wide spectrum of NLP tasks. This gives rise to the area of computational metaphor processing. However, it remains unclear what types of metaphors challenge current state-of-the-art models. In this paper, we test various NLP models on the VUA metaphor dataset and quantify to what extent metaphors affect models' performance on various downstream tasks. Analysis reveals that VUA includes a large number of metaphors that pose little difficulty to downstream tasks. We would like to shift the attention of researchers away from these metaphors to instead focus on challenging metaphors. To identify hard metaphors, we propose an automatic pipeline that identifies metaphors that challenge a particular model. Our analysis demonstrates that our detected hard metaphors contrast significantly with VUA and reduce the accuracy of machine translation by 16\%, QA performance by 4\%, NLI by 7\%, and metaphor identification recall by over 14\% for various popular NLP systems.
