Table of Contents
Fetching ...

Linguistic Blind Spots of Large Language Models

Jiali Cheng, Hadi Amiri

TL;DR

This study interrogates the ability of recent large language models to perform fine-grained linguistic annotation, revealing persistent blind spots in identifying POS tags, phrases, and clauses, especially as linguistic complexity rises. By constructing a complexity-balanced test set across eight levels and using gold annotations, the authors critically evaluate a range of models (including GPT-3.5, Llama3/2, Mistral, Gemini, and Mixtral), finding that even the strongest models exhibit substantial errors and variability. Word-level tasks are comparatively easier, while phrase- and sentence-level structures such as VP, CN, and T-units pose major challenges, with false positives and missing tags persisting across models. The work also shows that prompting strategies offer limited gains, and that model capacity provides only modest improvements, suggesting a need for linguistically informed data curation, curriculum learning, retrieval augmentation, tool use, and human-in-the-loop approaches to advance LLMs' fine-grained linguistic reasoning and reliability.

Abstract

Large language models (LLMs) are the foundation of many AI applications today. However, despite their remarkable proficiency in generating coherent text, questions linger regarding their ability to perform fine-grained linguistic annotation tasks, such as detecting nouns or verbs, or identifying more complex syntactic structures like clauses in input texts. These tasks require precise syntactic and semantic understanding of input text, and when LLMs underperform on specific linguistic structures, it raises concerns about their reliability for detailed linguistic analysis and whether their (even correct) outputs truly reflect an understanding of the inputs. In this paper, we empirically study the performance of recent LLMs on fine-grained linguistic annotation tasks. Through a series of experiments, we find that recent LLMs show limited efficacy in addressing linguistic queries and often struggle with linguistically complex inputs. We show that the most capable LLM (Llama3-70b) makes notable errors in detecting linguistic structures, such as misidentifying embedded clauses, failing to recognize verb phrases, and confusing complex nominals with clauses. Our results provide insights to inform future advancements in LLM design and development.

Linguistic Blind Spots of Large Language Models

TL;DR

This study interrogates the ability of recent large language models to perform fine-grained linguistic annotation, revealing persistent blind spots in identifying POS tags, phrases, and clauses, especially as linguistic complexity rises. By constructing a complexity-balanced test set across eight levels and using gold annotations, the authors critically evaluate a range of models (including GPT-3.5, Llama3/2, Mistral, Gemini, and Mixtral), finding that even the strongest models exhibit substantial errors and variability. Word-level tasks are comparatively easier, while phrase- and sentence-level structures such as VP, CN, and T-units pose major challenges, with false positives and missing tags persisting across models. The work also shows that prompting strategies offer limited gains, and that model capacity provides only modest improvements, suggesting a need for linguistically informed data curation, curriculum learning, retrieval augmentation, tool use, and human-in-the-loop approaches to advance LLMs' fine-grained linguistic reasoning and reliability.

Abstract

Large language models (LLMs) are the foundation of many AI applications today. However, despite their remarkable proficiency in generating coherent text, questions linger regarding their ability to perform fine-grained linguistic annotation tasks, such as detecting nouns or verbs, or identifying more complex syntactic structures like clauses in input texts. These tasks require precise syntactic and semantic understanding of input text, and when LLMs underperform on specific linguistic structures, it raises concerns about their reliability for detailed linguistic analysis and whether their (even correct) outputs truly reflect an understanding of the inputs. In this paper, we empirically study the performance of recent LLMs on fine-grained linguistic annotation tasks. Through a series of experiments, we find that recent LLMs show limited efficacy in addressing linguistic queries and often struggle with linguistically complex inputs. We show that the most capable LLM (Llama3-70b) makes notable errors in detecting linguistic structures, such as misidentifying embedded clauses, failing to recognize verb phrases, and confusing complex nominals with clauses. Our results provide insights to inform future advancements in LLM design and development.

Paper Structure

This paper contains 41 sections, 15 figures, 5 tables.

Figures (15)

  • Figure 1: Distribution of linguistic complexity in two widely-used NLP datasets. The plots show (a): a strong skew toward linguistically simple examples in the Penn Treebank and (b): a concentration around moderate complexity in CoNLL 2000, which highlights an overrepresentation of easier or medium-difficulty samples in the datasets.
  • Figure 2: Workflow for finding linguistic blind spots of LLMs. As illustrated in Appendix \ref{['sec:gpt_knowledge']}, GPT and other LLMs have good knowledge of our target tasks and the relevant terminology used in the prompts. [Linguistic Structure] in the prompts indicate any of the lexical or syntactic structures listed in Appendix \ref{['sec:app']}.
  • Figure 3: Performance of GPT-3.5 on texts of increasing linguistic complexity. GPT-3.5 achieves close to zero performance on CONJP, T, and CT. Figures \ref{['fig:qp_gemini']}-\ref{['fig:qp_mistral-7b']} in Appendix \ref{['sec:more_result']} show results of other LLMs.
  • Figure 4: Confusion matrix of POS tagging on GPT-3.5. Darker indicates larger value. Diagonal/off-diagonal elements represent correct/wrong predictions respectively.
  • Figure 5: Distribution of false positive predictions by GPT-3.5 for absent linguistic structures in input. All evaluated LLMs show very similar distribution
  • ...and 10 more figures