ConMeC: A Dataset for Metonymy Resolution with Common Nouns
Saptarshi Ghosh, Tianyu Jiang
TL;DR
This work introduces ConMeC, a large, human-annotated dataset of 6,000 Wikipedia sentences targeting metonymy in common nouns, addressing a gap in prior datasets focused on named entities. It proposes a two-step, chain-of-thought prompting framework with category-dependent prompts and self-consistency to detect metonymy using large language models, and compares these methods against a fine-tuned BERT baseline. Experiments across ConMeC and three other datasets show that LLMs can achieve competitive performance on certain metonymy categories, but still struggle with nuanced semantic distinctions, while BERT remains strongest overall on ConMeC. The results also reveal insights into cross-category generalization, the impact of contextual information, and the benefits and limits of majority voting in LLM-based metonymy resolution. The dataset and methodology offer a foundation for future improvements in metonymy understanding and downstream NLP tasks that rely on implicit semantic relations.
Abstract
Metonymy plays an important role in our daily communication. People naturally think about things using their most salient properties or commonly related concepts. For example, by saying "The bus decided to skip our stop today," we actually mean that the bus driver made the decision, not the bus. Prior work on metonymy resolution has mainly focused on named entities. However, metonymy involving common nouns (such as desk, baby, and school) is also a frequent and challenging phenomenon. We argue that NLP systems should be capable of identifying the metonymic use of common nouns in context. We create a new metonymy dataset ConMeC, which consists of 6,000 sentences, where each sentence is paired with a target common noun and annotated by humans to indicate whether that common noun is used metonymically or not in that context. We also introduce a chain-of-thought based prompting method for detecting metonymy using large language models (LLMs). We evaluate our LLM-based pipeline, as well as a supervised BERT model on our dataset and three other metonymy datasets. Our experimental results demonstrate that LLMs could achieve performance comparable to the supervised BERT model on well-defined metonymy categories, while still struggling with instances requiring nuanced semantic understanding. Our dataset is publicly available at: https://github.com/SaptGhosh/ConMeC.
