Metaphor identification using large language models: A comparison of RAG, prompt engineering, and fine-tuning

Matteo Fuoli; Weihang Huang; Jeannette Littlemore; Sarah Turner; Ellen Wilding

Metaphor identification using large language models: A comparison of RAG, prompt engineering, and fine-tuning

Matteo Fuoli, Weihang Huang, Jeannette Littlemore, Sarah Turner, Ellen Wilding

TL;DR

This paper tackles the scalability bottleneck in metaphor analysis by testing large language models (LLMs) for full-text metaphor identification. It compares retrieval-augmented generation (RAG), prompt engineering, and fine-tuning across multiple models using an IMDb film-review corpus annotated with a phraseological metaphor scheme and output as XML. Fine-tuning achieves the highest accuracy with a median $F1$ of $0.79$, with chain-of-thought prompting providing strong gains for prompting-based approaches; discrepancies with human coders are mostly systematic, reflecting established grey areas in metaphor theory. The results support a semi-automated, human-in-the-loop workflow that can scale metaphor annotation and serve as a testbed for refining metaphor identification protocols and theory.

Abstract

Metaphor is a pervasive feature of discourse and a powerful lens for examining cognition, emotion, and ideology. Large-scale analysis, however, has been constrained by the need for manual annotation due to the context-sensitive nature of metaphor. This study investigates the potential of large language models (LLMs) to automate metaphor identification in full texts. We compare three methods: (i) retrieval-augmented generation (RAG), where the model is provided with a codebook and instructed to annotate texts based on its rules and examples; (ii) prompt engineering, where we design task-specific verbal instructions; and (iii) fine-tuning, where the model is trained on hand-coded texts to optimize performance. Within prompt engineering, we test zero-shot, few-shot, and chain-of-thought strategies. Our results show that state-of-the-art closed-source LLMs can achieve high accuracy, with fine-tuning yielding a median F1 score of 0.79. A comparison of human and LLM outputs reveals that most discrepancies are systematic, reflecting well-known grey areas and conceptual challenges in metaphor theory. We propose that LLMs can be used to at least partly automate metaphor identification and can serve as a testbed for developing and refining metaphor identification protocols and the theory that underpins them.

Metaphor identification using large language models: A comparison of RAG, prompt engineering, and fine-tuning

TL;DR

Abstract

Metaphor identification using large language models: A comparison of RAG, prompt engineering, and fine-tuning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)