Table of Contents
Fetching ...

LLM The Genius Paradox: A Linguistic and Math Expert's Struggle with Simple Word-based Counting Problems

Nan Xu, Xuezhe Ma

TL;DR

Across simple word-based counting tasks, the paper challenges prevailing explanations for LLM failures and shows that subword tokenization, lack of character-level training, and embedding-size limits do not solely account for errors. By testing tokenization variants, character-level inputs, and cross-domain transfer from math/code-specialized LLMs, it reveals that reasoning strategies are the most robust route to accurate counting. The study also demonstrates that reasoning-based prompting (CoT, self-consistency, ToT) can enable near-perfect performance, especially in GPT-4o, while finetuning alone offers limited or even negative transfer. The findings advocate for training and evaluation focused on reasoning-before-responding to improve reliability on even simple linguistic tasks, and call for broader capability benchmarking beyond traditional tasks.

Abstract

Interestingly, LLMs yet struggle with some basic tasks that humans find trivial to handle, e.g., counting the number of character r's in the word "strawberry". There are several popular conjectures (e.g., tokenization, architecture and training data) regarding the reason for deficiency of LLMs in simple word-based counting problems, sharing the similar belief that such failure stems from model pretraining hence probably inevitable during deployment. In this paper, we carefully design multiple evaluation settings to investigate validity of prevalent conjectures. Meanwhile, we measure transferability of advanced mathematical and coding reasoning capabilities from specialized LLMs to simple counting tasks. Although specialized LLMs suffer from counting problems as well, we find conjectures about inherent deficiency of LLMs invalid and further seek opportunities to elicit knowledge and capabilities from LLMs that are beneficial to counting tasks. Compared with strategies such as finetuning and in-context learning that are commonly adopted to enhance performance on new or challenging tasks, we show that engaging reasoning is the most robust and efficient way to help LLMs better perceive tasks with more accurate responses. We hope our conjecture validation design could provide insights into the study of future critical failure modes of LLMs. Based on challenges in transferring advanced capabilities to much simpler tasks, we call for more attention to model capability acquisition and evaluation. We also highlight the importance of cultivating consciousness of "reasoning before responding" during model pretraining.

LLM The Genius Paradox: A Linguistic and Math Expert's Struggle with Simple Word-based Counting Problems

TL;DR

Across simple word-based counting tasks, the paper challenges prevailing explanations for LLM failures and shows that subword tokenization, lack of character-level training, and embedding-size limits do not solely account for errors. By testing tokenization variants, character-level inputs, and cross-domain transfer from math/code-specialized LLMs, it reveals that reasoning strategies are the most robust route to accurate counting. The study also demonstrates that reasoning-based prompting (CoT, self-consistency, ToT) can enable near-perfect performance, especially in GPT-4o, while finetuning alone offers limited or even negative transfer. The findings advocate for training and evaluation focused on reasoning-before-responding to improve reliability on even simple linguistic tasks, and call for broader capability benchmarking beyond traditional tasks.

Abstract

Interestingly, LLMs yet struggle with some basic tasks that humans find trivial to handle, e.g., counting the number of character r's in the word "strawberry". There are several popular conjectures (e.g., tokenization, architecture and training data) regarding the reason for deficiency of LLMs in simple word-based counting problems, sharing the similar belief that such failure stems from model pretraining hence probably inevitable during deployment. In this paper, we carefully design multiple evaluation settings to investigate validity of prevalent conjectures. Meanwhile, we measure transferability of advanced mathematical and coding reasoning capabilities from specialized LLMs to simple counting tasks. Although specialized LLMs suffer from counting problems as well, we find conjectures about inherent deficiency of LLMs invalid and further seek opportunities to elicit knowledge and capabilities from LLMs that are beneficial to counting tasks. Compared with strategies such as finetuning and in-context learning that are commonly adopted to enhance performance on new or challenging tasks, we show that engaging reasoning is the most robust and efficient way to help LLMs better perceive tasks with more accurate responses. We hope our conjecture validation design could provide insights into the study of future critical failure modes of LLMs. Based on challenges in transferring advanced capabilities to much simpler tasks, we call for more attention to model capability acquisition and evaluation. We also highlight the importance of cultivating consciousness of "reasoning before responding" during model pretraining.

Paper Structure

This paper contains 38 sections, 1 equation, 13 figures, 10 tables.

Figures (13)

  • Figure 1: Performance comparison on Task I with different tokenization strategies to verify Conjecture I that deficiency in word-based counting is caused by subword tokenization of LLMs.Top: There is no perturbation that benefits all studied models identifying core character-level information (above $0$). Bottom: Directly feeding character tokens instead of subword tokens does not help LLMs perceive individual characters within words. Above comparison as well as results on other three studied tasks shown in \ref{['fig:tokenization_extra']} refuse the conjecture regarding LLM tokenization.
  • Figure 2: Performance comparison of LLMs between natural word and character input on classification dataset IMDB. Although questions represented by individual characters are rarely seen, all studied models are able to achieve accuracy much higher than random guess denoted by - - -. The minor performance drop compared with natural word input implies that LLMs have the ability to handle tasks requiring character-level understanding, rejecting the conjecture relevant to lack of character training. Figure \ref{['fig:classification_more']} shows similar observations on other datasets.
  • Figure 3: Performance variation from GPT-4o on four tasks. $1$st & $2$nd column: there is no noteworthy correlation between model performance and number of unique characters in queried words. $3$rd & $4$th column: we observe clear trend of performance drop from both models when the word length keeps on increasing since 10. We leave similar trend from Llama 3 in \ref{['fig:unique_len_more']}.
  • Figure 4: Performance comparison among LLMs trained on general and special domain data for Task II. Models specialized in mathematical reasoning () or theorem proving () can not better handle word-based counting tasks than general () ones. Although code models are able to write Python codes () and solve tasks successfully, they fail to reason and answer accurately in open-ended setting () . We leave results containing similar observations on three other tasks in \ref{['fig:math_code_more']}.
  • Figure 5: Benefits of applying different reasoning strategies to LLMs for Task I Char Occur. We observe noticeable improvement from all studied reasoning strategies over baselines that particularly request numeric responses () or open-ended answers (). GPT-4o with the additional reasoning procedure can even solve tasks with perfection. We show similar improvement on other three tasks in \ref{['fig:reasoning_others']}.
  • ...and 8 more figures