Table of Contents
Fetching ...

A Hard Nut to Crack: Idiom Detection with Conversational Large Language Models

Francesca De Luca Fornaciari, Begoña Altuna, Itziar Gonzalez-Dios, Maite Melero

TL;DR

The paper addresses idiom processing in NLP by introducing IdioTS, a sentence-level idiom-detection test suite designed to separate figurative from literal interpretations of PIEs. It presents a two-level evaluation framework—automatic metrics and thorough manual/error analysis—applied to three open-source conversational LLMs to reveal grounding gaps and bias phenomena. Key findings show high recall but low specificity in some models and substantial true-positive grounding inconsistencies, underscoring the difficulty of robust idiom detection in short, ambiguous sentences. The work contributes the IdioTS resource, a transparent evaluation methodology, and a detailed error taxonomy, with implications for improving figurative-language understanding and evaluation in LLMs.

Abstract

In this work, we explore idiomatic language processing with Large Language Models (LLMs). We introduce the Idiomatic language Test Suite IdioTS, a new dataset of difficult examples specifically designed by language experts to assess the capabilities of LLMs to process figurative language at sentence level. We propose a comprehensive evaluation methodology based on an idiom detection task, where LLMs are prompted with detecting an idiomatic expression in a given English sentence. We present a thorough automatic and manual evaluation of the results and an extensive error analysis.

A Hard Nut to Crack: Idiom Detection with Conversational Large Language Models

TL;DR

The paper addresses idiom processing in NLP by introducing IdioTS, a sentence-level idiom-detection test suite designed to separate figurative from literal interpretations of PIEs. It presents a two-level evaluation framework—automatic metrics and thorough manual/error analysis—applied to three open-source conversational LLMs to reveal grounding gaps and bias phenomena. Key findings show high recall but low specificity in some models and substantial true-positive grounding inconsistencies, underscoring the difficulty of robust idiom detection in short, ambiguous sentences. The work contributes the IdioTS resource, a transparent evaluation methodology, and a detailed error taxonomy, with implications for improving figurative-language understanding and evaluation in LLMs.

Abstract

In this work, we explore idiomatic language processing with Large Language Models (LLMs). We introduce the Idiomatic language Test Suite IdioTS, a new dataset of difficult examples specifically designed by language experts to assess the capabilities of LLMs to process figurative language at sentence level. We propose a comprehensive evaluation methodology based on an idiom detection task, where LLMs are prompted with detecting an idiomatic expression in a given English sentence. We present a thorough automatic and manual evaluation of the results and an extensive error analysis.
Paper Structure (20 sections, 1 figure, 3 tables)