AGI Is Coming... Right After AI Learns to Play Wordle
Sarath Shekkizhar, Romain Cosentino
TL;DR
This work probes the limits of multimodal Computer-Using Agents by evaluating OpenAI's CUA on the Wordle game, a simple yet information-rich task. Using a GUI-focused evaluation with screenshots and self-annotation, the study reveals a dramatic context-dependent color recognition failure that correlates with poor puzzle-solving performance. The key finding is that color perception degrades as the game progresses, with edge letters more reliably identified than central ones, likely tied to image tokenization and attention mechanisms. The results argue for foundational improvements in perception, reasoning, and architecture beyond current benchmarks, highlighting important directions for future research toward robust, generalizable agents.
Abstract
This paper investigates multimodal agents, in particular, OpenAI's Computer-User Agent (CUA), trained to control and complete tasks through a standard computer interface, similar to humans. We evaluated the agent's performance on the New York Times Wordle game to elicit model behaviors and identify shortcomings. Our findings revealed a significant discrepancy in the model's ability to recognize colors correctly depending on the context. The model had a $5.36\%$ success rate over several hundred runs across a week of Wordle. Despite the immense enthusiasm surrounding AI agents and their potential to usher in Artificial General Intelligence (AGI), our findings reinforce the fact that even simple tasks present substantial challenges for today's frontier AI models. We conclude with a discussion of the potential underlying causes, implications for future development, and research directions to improve these AI systems.
