Table of Contents
Fetching ...

Metaphor identification using large language models: A comparison of RAG, prompt engineering, and fine-tuning

Matteo Fuoli, Weihang Huang, Jeannette Littlemore, Sarah Turner, Ellen Wilding

TL;DR

This paper tackles the scalability bottleneck in metaphor analysis by testing large language models (LLMs) for full-text metaphor identification. It compares retrieval-augmented generation (RAG), prompt engineering, and fine-tuning across multiple models using an IMDb film-review corpus annotated with a phraseological metaphor scheme and output as XML. Fine-tuning achieves the highest accuracy with a median $F1$ of $0.79$, with chain-of-thought prompting providing strong gains for prompting-based approaches; discrepancies with human coders are mostly systematic, reflecting established grey areas in metaphor theory. The results support a semi-automated, human-in-the-loop workflow that can scale metaphor annotation and serve as a testbed for refining metaphor identification protocols and theory.

Abstract

Metaphor is a pervasive feature of discourse and a powerful lens for examining cognition, emotion, and ideology. Large-scale analysis, however, has been constrained by the need for manual annotation due to the context-sensitive nature of metaphor. This study investigates the potential of large language models (LLMs) to automate metaphor identification in full texts. We compare three methods: (i) retrieval-augmented generation (RAG), where the model is provided with a codebook and instructed to annotate texts based on its rules and examples; (ii) prompt engineering, where we design task-specific verbal instructions; and (iii) fine-tuning, where the model is trained on hand-coded texts to optimize performance. Within prompt engineering, we test zero-shot, few-shot, and chain-of-thought strategies. Our results show that state-of-the-art closed-source LLMs can achieve high accuracy, with fine-tuning yielding a median F1 score of 0.79. A comparison of human and LLM outputs reveals that most discrepancies are systematic, reflecting well-known grey areas and conceptual challenges in metaphor theory. We propose that LLMs can be used to at least partly automate metaphor identification and can serve as a testbed for developing and refining metaphor identification protocols and the theory that underpins them.

Metaphor identification using large language models: A comparison of RAG, prompt engineering, and fine-tuning

TL;DR

This paper tackles the scalability bottleneck in metaphor analysis by testing large language models (LLMs) for full-text metaphor identification. It compares retrieval-augmented generation (RAG), prompt engineering, and fine-tuning across multiple models using an IMDb film-review corpus annotated with a phraseological metaphor scheme and output as XML. Fine-tuning achieves the highest accuracy with a median of , with chain-of-thought prompting providing strong gains for prompting-based approaches; discrepancies with human coders are mostly systematic, reflecting established grey areas in metaphor theory. The results support a semi-automated, human-in-the-loop workflow that can scale metaphor annotation and serve as a testbed for refining metaphor identification protocols and theory.

Abstract

Metaphor is a pervasive feature of discourse and a powerful lens for examining cognition, emotion, and ideology. Large-scale analysis, however, has been constrained by the need for manual annotation due to the context-sensitive nature of metaphor. This study investigates the potential of large language models (LLMs) to automate metaphor identification in full texts. We compare three methods: (i) retrieval-augmented generation (RAG), where the model is provided with a codebook and instructed to annotate texts based on its rules and examples; (ii) prompt engineering, where we design task-specific verbal instructions; and (iii) fine-tuning, where the model is trained on hand-coded texts to optimize performance. Within prompt engineering, we test zero-shot, few-shot, and chain-of-thought strategies. Our results show that state-of-the-art closed-source LLMs can achieve high accuracy, with fine-tuning yielding a median F1 score of 0.79. A comparison of human and LLM outputs reveals that most discrepancies are systematic, reflecting well-known grey areas and conceptual challenges in metaphor theory. We propose that LLMs can be used to at least partly automate metaphor identification and can serve as a testbed for developing and refining metaphor identification protocols and the theory that underpins them.

Paper Structure

This paper contains 22 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Median F1 scores with corresponding distributions across models and core methods. The line inside each box represents the median, the box spans the interquartile range (IQR), notches approximate 95% confidence intervals for the median, and whiskers indicate variability beyond the quartiles.
  • Figure 2: Median F1 scores with corresponding distributions across models and core prompt engineering strategies. The line inside each box represents the median, the box spans the interquartile range (IQR), notches approximate 95% confidence intervals for the median, and whiskers indicate variability beyond the quartiles.