NYT-Connections: A Deceptively Simple Text Classification Task that Stumps System-1 Thinkers
Angel Yahir Loredo Lopez, Tyler McDonald, Ali Emami
TL;DR
NYT-Connections introduces a living, word-grouping benchmark derived from the New York Times Connections game to isolate deliberate System 2 reasoning in LLMs. The task challenges models with 16 terms organized into 4 related groups, using a simple embedding-based heuristic and beam search as a baseline, and evaluates six LLMs, a heuristic, and humans under One Try, No Hints, and Full Hints configurations. The results show humans outperforming models by roughly 30 percentage points, with chain-of-thought prompting offering limited gains and simple heuristics approaching the performance of some LLMs, suggesting current models operate between System 1 and System 2 capabilities. The paper emphasizes linguistic isolation, resistance to shortcuts, and continual dataset updates as key features, positioning NYT-Connections as a useful tool for probing deliberate reasoning and guiding future improvements, including multilingual extensions and broader prompting strategies.
Abstract
Large Language Models (LLMs) have shown impressive performance on various benchmarks, yet their ability to engage in deliberate reasoning remains questionable. We present NYT-Connections, a collection of 358 simple word classification puzzles derived from the New York Times Connections game. This benchmark is designed to penalize quick, intuitive "System 1" thinking, isolating fundamental reasoning skills. We evaluated six recent LLMs, a simple machine learning heuristic, and humans across three configurations: single-attempt, multiple attempts without hints, and multiple attempts with contextual hints. Our findings reveal a significant performance gap: even top-performing LLMs like GPT-4 fall short of human performance by nearly 30%. Notably, advanced prompting techniques such as Chain-of-Thought and Self-Consistency show diminishing returns as task difficulty increases. NYT-Connections uniquely combines linguistic isolation, resistance to intuitive shortcuts, and regular updates to mitigate data leakage, offering a novel tool for assessing LLM reasoning capabilities.
