Can large language models understand uncommon meanings of common words?

Jinyang Wu; Feihu Che; Xinxin Zheng; Shuai Zhang; Ruihan Jin; Shuai Nie; Pengpeng Shao; Jianhua Tao

Can large language models understand uncommon meanings of common words?

Jinyang Wu, Feihu Che, Xinxin Zheng, Shuai Zhang, Ruihan Jin, Shuai Nie, Pengpeng Shao, Jianhua Tao

TL;DR

This paper introduces LeSC, a novel lexical-semantic benchmark designed to probe fine-grained understanding of uncommon meanings of common words in LLMs, with a cross-lingual dimension. Through a diverse model suite and evaluation framework that includes absolute and weighted accuracies, prompting strategies, RAG, and attention-based visualization, the study reveals persistent gaps in LSU across both open- and closed-source models, including GPT-4 and GPT-3.5, relative to 16-year-old humans. While role-oriented prompts and retrieval-augmented generation yield improvements in some cases, the gains are limited and often diminish with model scale, underscoring fundamental limitations in current LLMs’ fine-grained semantic reasoning. The findings suggest a need for deeper investigation into stochastic-parrot limitations, improved transferability, and robust mitigation strategies to advance genuinely human-like lexical understanding in LLMs.

Abstract

Large language models (LLMs) like ChatGPT have shown significant advancements across diverse natural language understanding (NLU) tasks, including intelligent dialogue and autonomous agents. Yet, lacking widely acknowledged testing mechanisms, answering `whether LLMs are stochastic parrots or genuinely comprehend the world' remains unclear, fostering numerous studies and sparking heated debates. Prevailing research mainly focuses on surface-level NLU, neglecting fine-grained explorations. However, such explorations are crucial for understanding their unique comprehension mechanisms, aligning with human cognition, and finally enhancing LLMs' general NLU capacities. To address this gap, our study delves into LLMs' nuanced semantic comprehension capabilities, particularly regarding common words with uncommon meanings. The idea stems from foundational principles of human communication within psychology, which underscore accurate shared understandings of word semantics. Specifically, this paper presents the innovative construction of a Lexical Semantic Comprehension (LeSC) dataset with novel evaluation metrics, the first benchmark encompassing both fine-grained and cross-lingual dimensions. Introducing models of both open-source and closed-source, varied scales and architectures, our extensive empirical experiments demonstrate the inferior performance of existing models in this basic lexical-meaning understanding task. Notably, even the state-of-the-art LLMs GPT-4 and GPT-3.5 lag behind 16-year-old humans by 3.9% and 22.3%, respectively. Additionally, multiple advanced prompting techniques and retrieval-augmented generation are also introduced to help alleviate this trouble, yet limitations persist. By highlighting the above critical shortcomings, this research motivates further investigation and offers novel insights for developing more intelligent LLMs.

Can large language models understand uncommon meanings of common words?

TL;DR

Abstract

Paper Structure (34 sections, 7 equations, 7 figures, 4 tables)

This paper contains 34 sections, 7 equations, 7 figures, 4 tables.

Introduction
Materials and methods
The LeSC datasets
Dataset creation
Evaluation metrics
Absolute Accuracy
Weighted Accuracy
Models and methods
Selected models
Human evaluation and random baseline
Humam evaluation
Random baseline
Prompting methods
Retrieval-augmented generation
Visualization Technique
...and 19 more sections

Figures (7)

Figure 1: An example from LeSC dataset. Within the gray box are the inputs, comprising a prompt, a question, and provided options, and 'A', 'B', 'C' refer to 'low in price', 'unwilling to spend money', 'of poor quality; inferior', respectively. Within the green box, the answer of ChatGPT is 'A', inconsistent with the correct answer 'C'.
Figure 2: The workflow of LeSC. In stage 1, we first construct the LeSC dataset using GAOKAO and CET sources. After that, in stage 2, we employ advanced strategies to obtain benchmarking results for LLMs.
Figure 3: Results for the overall performance on LeSC dataset under different settings. Tile 'Average', 'Role-oriented Prompts', and 'Task-oriented Prompts' refer to the accuracy ($\times$ 100) of LLMs on all, role-oriented, task-oriented prompts, respectively. We also plot the performance levels of humans (92$\%$) and random selection (23$\%$) as a reference.
Figure 4: Accuracy ($\times$ 100) for different model scales and architectures, and pretraining corpora concerning languages (CN, EN) of options.
Figure 5: Results on LeSC corresponding to k (shot) in few-shot prompting. '$Acc_{a}$ ', '$Acc_{wtd}$', 'Std' refer to absolute and weighted accuracy, and standard deviation, respectively.
...and 2 more figures

Can large language models understand uncommon meanings of common words?

TL;DR

Abstract

Can large language models understand uncommon meanings of common words?

Authors

TL;DR

Abstract

Table of Contents

Figures (7)