Linguistic Blind Spots of Large Language Models
Jiali Cheng, Hadi Amiri
TL;DR
This study interrogates the ability of recent large language models to perform fine-grained linguistic annotation, revealing persistent blind spots in identifying POS tags, phrases, and clauses, especially as linguistic complexity rises. By constructing a complexity-balanced test set across eight levels and using gold annotations, the authors critically evaluate a range of models (including GPT-3.5, Llama3/2, Mistral, Gemini, and Mixtral), finding that even the strongest models exhibit substantial errors and variability. Word-level tasks are comparatively easier, while phrase- and sentence-level structures such as VP, CN, and T-units pose major challenges, with false positives and missing tags persisting across models. The work also shows that prompting strategies offer limited gains, and that model capacity provides only modest improvements, suggesting a need for linguistically informed data curation, curriculum learning, retrieval augmentation, tool use, and human-in-the-loop approaches to advance LLMs' fine-grained linguistic reasoning and reliability.
Abstract
Large language models (LLMs) are the foundation of many AI applications today. However, despite their remarkable proficiency in generating coherent text, questions linger regarding their ability to perform fine-grained linguistic annotation tasks, such as detecting nouns or verbs, or identifying more complex syntactic structures like clauses in input texts. These tasks require precise syntactic and semantic understanding of input text, and when LLMs underperform on specific linguistic structures, it raises concerns about their reliability for detailed linguistic analysis and whether their (even correct) outputs truly reflect an understanding of the inputs. In this paper, we empirically study the performance of recent LLMs on fine-grained linguistic annotation tasks. Through a series of experiments, we find that recent LLMs show limited efficacy in addressing linguistic queries and often struggle with linguistically complex inputs. We show that the most capable LLM (Llama3-70b) makes notable errors in detecting linguistic structures, such as misidentifying embedded clauses, failing to recognize verb phrases, and confusing complex nominals with clauses. Our results provide insights to inform future advancements in LLM design and development.
