Table of Contents
Fetching ...

AGI Is Coming... Right After AI Learns to Play Wordle

Sarath Shekkizhar, Romain Cosentino

TL;DR

This work probes the limits of multimodal Computer-Using Agents by evaluating OpenAI's CUA on the Wordle game, a simple yet information-rich task. Using a GUI-focused evaluation with screenshots and self-annotation, the study reveals a dramatic context-dependent color recognition failure that correlates with poor puzzle-solving performance. The key finding is that color perception degrades as the game progresses, with edge letters more reliably identified than central ones, likely tied to image tokenization and attention mechanisms. The results argue for foundational improvements in perception, reasoning, and architecture beyond current benchmarks, highlighting important directions for future research toward robust, generalizable agents.

Abstract

This paper investigates multimodal agents, in particular, OpenAI's Computer-User Agent (CUA), trained to control and complete tasks through a standard computer interface, similar to humans. We evaluated the agent's performance on the New York Times Wordle game to elicit model behaviors and identify shortcomings. Our findings revealed a significant discrepancy in the model's ability to recognize colors correctly depending on the context. The model had a $5.36\%$ success rate over several hundred runs across a week of Wordle. Despite the immense enthusiasm surrounding AI agents and their potential to usher in Artificial General Intelligence (AGI), our findings reinforce the fact that even simple tasks present substantial challenges for today's frontier AI models. We conclude with a discussion of the potential underlying causes, implications for future development, and research directions to improve these AI systems.

AGI Is Coming... Right After AI Learns to Play Wordle

TL;DR

This work probes the limits of multimodal Computer-Using Agents by evaluating OpenAI's CUA on the Wordle game, a simple yet information-rich task. Using a GUI-focused evaluation with screenshots and self-annotation, the study reveals a dramatic context-dependent color recognition failure that correlates with poor puzzle-solving performance. The key finding is that color perception degrades as the game progresses, with edge letters more reliably identified than central ones, likely tied to image tokenization and attention mechanisms. The results argue for foundational improvements in perception, reasoning, and architecture beyond current benchmarks, highlighting important directions for future research toward robust, generalizable agents.

Abstract

This paper investigates multimodal agents, in particular, OpenAI's Computer-User Agent (CUA), trained to control and complete tasks through a standard computer interface, similar to humans. We evaluated the agent's performance on the New York Times Wordle game to elicit model behaviors and identify shortcomings. Our findings revealed a significant discrepancy in the model's ability to recognize colors correctly depending on the context. The model had a success rate over several hundred runs across a week of Wordle. Despite the immense enthusiasm surrounding AI agents and their potential to usher in Artificial General Intelligence (AGI), our findings reinforce the fact that even simple tasks present substantial challenges for today's frontier AI models. We conclude with a discussion of the potential underlying causes, implications for future development, and research directions to improve these AI systems.

Paper Structure

This paper contains 19 sections, 5 figures, 1 table, 1 algorithm.

Figures (5)

  • Figure 1: Visualization of Wordle game played by the CUA agent – During a game session, the CUA agent is instructed to self-annotate or summarize the screen (via function calling). At the displayed state of the game, the CUA agent describes the screen as follows: "The grid shows attempts with first guess (S T A R E) displaying two letters in yellow (S and T), 2nd guess (L E E T S) shows green for E, 3rd guess (S I Z E S) retained yellow for S and T, 4th guess (S T E R N) maintained yellow for S, T, and E." The CUA agent's description indicates that it is unable to recognize correctly as well as consistently the color assigned to each letter while playing the game.
  • Figure 2: (Left)CUA color observation accuracy by letter position and attempt - Heatmap showing color observation (as identified by the CUA model while playing the game) accuracy by letter position (1-5) and attempt number (1-5). Data shows highest accuracy (94%) at position 1, attempt 1, with significant degradation in accuracy in later attempts and central positions. (Right)Depiction of the potential image tokenizer patch boundary - In red we display the potential boundary of the image tokenization patches. We note that since we do not have access to the model and interact with the model via the provided API, the boundary has been plotted by deduction from the information gathered from OpenAI documentation. We believe that poor perception accuracy is the result of the image tokenization phenomena and the increasing complexity in terms of reasoning as the number of attempts increases.
  • Figure 3: (Left) Model-generated color observation accuracy by attempt number. The average accuracy (bold black line) decreases dramatically from $42\%$ in attempt $1$ to $6\%$ in attempt $5$, with individual words showing varying decline patterns. This demonstrates the agent's increasing difficulty in correctly perceiving colors as the game progresses. Note that the accuracy in the first attempt, although relatively higher than the rest, indicates that there is a fundamental perception problem in the model. (Right) Most common observation errors by color type - Bar chart showing the frequency of different types of color recognition errors. Here, Expected is the color that the model should have observed while Actual is the color observation made by the model. Gray→Yellow and Gray→Green were the most common errors, followed by Yellow→Green and Yellow→Gray. This suggests the agent has particular difficulty distinguishing gray tiles from colored tiles. Alternatively, the model might just be biased toward seeing specific colors even when gray (higher confidence in its own chain of thought and guesses chowdhury2025truthfulness).
  • Figure 4: (Left)Success rate by word - Bar chart showing success rate by target Wordle word. ARROW ($13.6\%$) and TURBO ($10.0\%$) had the highest success rates, while SHEAR and WHEAT had a $0\%$ success rate. The pattern closely mirrors observation accuracy, supporting the connection between color perception and game performance. (Right)Model-generated color observation accuracy (average over attempts) - Bar chart showing model-generated observation accuracy by word. ARROW shows the highest accuracy ($39.3\%$), followed by NURSE ($28.3\%$) and LAUGH ($23.3\%$), while SHEAR shows $0\%$ accuracy. This suggests word-specific factors do influence color perception ability.
  • Figure 5: Correlation between observation color accuracy and word success rate - Scatter plot showing the correlation between observation accuracy and success rate across different words. With a Pearson r of $0.694$ and p-value of $0.056$, there is a strong positive correlation between the agent's ability to correctly perceive colors and its ability to solve the Wordle puzzle, highlighting how perceptual failures directly impact task performance.