Table of Contents
Fetching ...

Large Language Models Lack Understanding of Character Composition of Words

Andrew Shin, Kunitake Kaneko

TL;DR

The paper investigates whether large language models understand the character-level composition of words, a fine-grained linguistic unit often overlooked in token-level training. It introduces a battery of simple character-level tasks (retrieval, insertion, deletion, replacement, reordering, counting) and compares four publicly available LLMs with human performance, expanding also to token-level variants for contrast. Results show a pronounced gap: models struggle on character-level tasks despite strong token-level performance, with human performance near perfect and reordering comparatively robust, suggesting fundamental limitations in current training paradigms. The authors discuss potential directions, including embedding character-level information into word representations and incorporating visual features to emulate human character perception, highlighting implications for multilingual and cross-script language processing.

Abstract

Large language models (LLMs) have demonstrated remarkable performances on a wide range of natural language tasks. Yet, LLMs' successes have been largely restricted to tasks concerning words, sentences, or documents, and it remains questionable how much they understand the minimal units of text, namely characters. In this paper, we examine contemporary LLMs regarding their ability to understand character composition of words, and show that most of them fail to reliably carry out even the simple tasks that can be handled by humans with perfection. We analyze their behaviors with comparison to token level performances, and discuss the potential directions for future research.

Large Language Models Lack Understanding of Character Composition of Words

TL;DR

The paper investigates whether large language models understand the character-level composition of words, a fine-grained linguistic unit often overlooked in token-level training. It introduces a battery of simple character-level tasks (retrieval, insertion, deletion, replacement, reordering, counting) and compares four publicly available LLMs with human performance, expanding also to token-level variants for contrast. Results show a pronounced gap: models struggle on character-level tasks despite strong token-level performance, with human performance near perfect and reordering comparatively robust, suggesting fundamental limitations in current training paradigms. The authors discuss potential directions, including embedding character-level information into word representations and incorporating visual features to emulate human character perception, highlighting implications for multilingual and cross-script language processing.

Abstract

Large language models (LLMs) have demonstrated remarkable performances on a wide range of natural language tasks. Yet, LLMs' successes have been largely restricted to tasks concerning words, sentences, or documents, and it remains questionable how much they understand the minimal units of text, namely characters. In this paper, we examine contemporary LLMs regarding their ability to understand character composition of words, and show that most of them fail to reliably carry out even the simple tasks that can be handled by humans with perfection. We analyze their behaviors with comparison to token level performances, and discuss the potential directions for future research.
Paper Structure (13 sections, 8 tables)