Unveiling LLMs' Metaphorical Understanding: Exploring Conceptual Irrelevance, Context Leveraging and Syntactic Influence
Fengying Ye, Shanshan Wang, Lidia S. Chao, Derek F. Wong
TL;DR
This study probes LLMs' metaphor understanding through three angles: concept mapping in embedding space, the existence of a metaphor-literal repository within models, and sensitivity to syntactic structure. Using a spatial analysis with $d_p$ and $\\cos\\theta$, imagination overlap metrics, and syntactic disruption tests across Fig-QA and MUNCH datasets, the authors reveal 15–25% concept-irrelevant interpretations, partial but limited context utilization, and notable sensitivity to syntax irregularities. GPT-4o shows strongest reduction in concept-irrelevance while V3-671B offers stronger alignment in the conceptual plane, but overall results indicate inconsistent metaphor comprehension across models. The work highlights the need for robust methods that fuse contextual reasoning with syntactic awareness to achieve deeper, concept-level metaphor understanding in LLMs. $d_p$, $\\cos\\theta$, and $Ad$ emerge as complementary diagnostics for evaluating conceptual alignment in generated interpretations.
Abstract
Metaphor analysis is a complex linguistic phenomenon shaped by context and external factors. While Large Language Models (LLMs) demonstrate advanced capabilities in knowledge integration, contextual reasoning, and creative generation, their mechanisms for metaphor comprehension remain insufficiently explored. This study examines LLMs' metaphor-processing abilities from three perspectives: (1) Concept Mapping: using embedding space projections to evaluate how LLMs map concepts in target domains (e.g., misinterpreting "fall in love" as "drop down from love"); (2) Metaphor-Literal Repository: analyzing metaphorical words and their literal counterparts to identify inherent metaphorical knowledge; and (3) Syntactic Sensitivity: assessing how metaphorical syntactic structures influence LLMs' performance. Our findings reveal that LLMs generate 15\%-25\% conceptually irrelevant interpretations, depend on metaphorical indicators in training data rather than contextual cues, and are more sensitive to syntactic irregularities than to structural comprehension. These insights underline the limitations of LLMs in metaphor analysis and call for more robust computational approaches.
