Table of Contents
Fetching ...

Finding Challenging Metaphors that Confuse Pretrained Language Models

Yucheng Li, Frank Guerin, Chenghua Lin

TL;DR

The paper investigates which metaphors truly challenge pretrained language models, revealing that many VUA metaphors are easy for modern NLP systems. It introduces a model-specific hard metaphor detection framework that uses Word Sense Disambiguation and contrastive learning to produce Sense-Only Representations, paired with an overlap ratio $\varphi$ to flag hard cases. A RoBERTa-tailored Hard Metaphor Dataset (HMD) with 21k examples across 82 words and 110 senses is built, and downstream evaluation across MT, NLI, QA, and metaphor identification shows substantial degradation on hard metaphors, highlighting the role of context over novelty. The work presents hard metaphors as a meaningful benchmark for metaphor processing, promotes targeted, model-aware testing, and suggests directions toward long-tail and symbolic approaches to improve downstream robustness.

Abstract

Metaphors are considered to pose challenges for a wide spectrum of NLP tasks. This gives rise to the area of computational metaphor processing. However, it remains unclear what types of metaphors challenge current state-of-the-art models. In this paper, we test various NLP models on the VUA metaphor dataset and quantify to what extent metaphors affect models' performance on various downstream tasks. Analysis reveals that VUA includes a large number of metaphors that pose little difficulty to downstream tasks. We would like to shift the attention of researchers away from these metaphors to instead focus on challenging metaphors. To identify hard metaphors, we propose an automatic pipeline that identifies metaphors that challenge a particular model. Our analysis demonstrates that our detected hard metaphors contrast significantly with VUA and reduce the accuracy of machine translation by 16\%, QA performance by 4\%, NLI by 7\%, and metaphor identification recall by over 14\% for various popular NLP systems.

Finding Challenging Metaphors that Confuse Pretrained Language Models

TL;DR

The paper investigates which metaphors truly challenge pretrained language models, revealing that many VUA metaphors are easy for modern NLP systems. It introduces a model-specific hard metaphor detection framework that uses Word Sense Disambiguation and contrastive learning to produce Sense-Only Representations, paired with an overlap ratio to flag hard cases. A RoBERTa-tailored Hard Metaphor Dataset (HMD) with 21k examples across 82 words and 110 senses is built, and downstream evaluation across MT, NLI, QA, and metaphor identification shows substantial degradation on hard metaphors, highlighting the role of context over novelty. The work presents hard metaphors as a meaningful benchmark for metaphor processing, promotes targeted, model-aware testing, and suggests directions toward long-tail and symbolic approaches to improve downstream robustness.

Abstract

Metaphors are considered to pose challenges for a wide spectrum of NLP tasks. This gives rise to the area of computational metaphor processing. However, it remains unclear what types of metaphors challenge current state-of-the-art models. In this paper, we test various NLP models on the VUA metaphor dataset and quantify to what extent metaphors affect models' performance on various downstream tasks. Analysis reveals that VUA includes a large number of metaphors that pose little difficulty to downstream tasks. We would like to shift the attention of researchers away from these metaphors to instead focus on challenging metaphors. To identify hard metaphors, we propose an automatic pipeline that identifies metaphors that challenge a particular model. Our analysis demonstrates that our detected hard metaphors contrast significantly with VUA and reduce the accuracy of machine translation by 16\%, QA performance by 4\%, NLI by 7\%, and metaphor identification recall by over 14\% for various popular NLP systems.
Paper Structure (23 sections, 1 equation, 6 figures, 5 tables)

This paper contains 23 sections, 1 equation, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Comparison between metaphoric and literal samples from VUA dataset on various NLP tasks.
  • Figure 2: POS breakdown analysis on metaphors from VUA test set.
  • Figure 3: PCA visualization of RoBERTa's original embedding (a) and Sense Only Representations (SORs) (b) of the word act in different passages. The legend shows the word sense gloss from WordNet. Two examples are given at the bottom right, one (S2) distinguished successfully, the other (S1) in a wrong sense cluster. Examples are from the SemCor dataset.
  • Figure 4: Contrastive learning on word senses.
  • Figure 5: (a) Performance gap between hard metaphors and literal counterparts. (b) Metaphor Identification (MI) Recall score of MelBERT on hard metaphors with different overlap ratio.
  • ...and 1 more figures