Table of Contents
Fetching ...

NYT-Connections: A Deceptively Simple Text Classification Task that Stumps System-1 Thinkers

Angel Yahir Loredo Lopez, Tyler McDonald, Ali Emami

TL;DR

NYT-Connections introduces a living, word-grouping benchmark derived from the New York Times Connections game to isolate deliberate System 2 reasoning in LLMs. The task challenges models with 16 terms organized into 4 related groups, using a simple embedding-based heuristic and beam search as a baseline, and evaluates six LLMs, a heuristic, and humans under One Try, No Hints, and Full Hints configurations. The results show humans outperforming models by roughly 30 percentage points, with chain-of-thought prompting offering limited gains and simple heuristics approaching the performance of some LLMs, suggesting current models operate between System 1 and System 2 capabilities. The paper emphasizes linguistic isolation, resistance to shortcuts, and continual dataset updates as key features, positioning NYT-Connections as a useful tool for probing deliberate reasoning and guiding future improvements, including multilingual extensions and broader prompting strategies.

Abstract

Large Language Models (LLMs) have shown impressive performance on various benchmarks, yet their ability to engage in deliberate reasoning remains questionable. We present NYT-Connections, a collection of 358 simple word classification puzzles derived from the New York Times Connections game. This benchmark is designed to penalize quick, intuitive "System 1" thinking, isolating fundamental reasoning skills. We evaluated six recent LLMs, a simple machine learning heuristic, and humans across three configurations: single-attempt, multiple attempts without hints, and multiple attempts with contextual hints. Our findings reveal a significant performance gap: even top-performing LLMs like GPT-4 fall short of human performance by nearly 30%. Notably, advanced prompting techniques such as Chain-of-Thought and Self-Consistency show diminishing returns as task difficulty increases. NYT-Connections uniquely combines linguistic isolation, resistance to intuitive shortcuts, and regular updates to mitigate data leakage, offering a novel tool for assessing LLM reasoning capabilities.

NYT-Connections: A Deceptively Simple Text Classification Task that Stumps System-1 Thinkers

TL;DR

NYT-Connections introduces a living, word-grouping benchmark derived from the New York Times Connections game to isolate deliberate System 2 reasoning in LLMs. The task challenges models with 16 terms organized into 4 related groups, using a simple embedding-based heuristic and beam search as a baseline, and evaluates six LLMs, a heuristic, and humans under One Try, No Hints, and Full Hints configurations. The results show humans outperforming models by roughly 30 percentage points, with chain-of-thought prompting offering limited gains and simple heuristics approaching the performance of some LLMs, suggesting current models operate between System 1 and System 2 capabilities. The paper emphasizes linguistic isolation, resistance to shortcuts, and continual dataset updates as key features, positioning NYT-Connections as a useful tool for probing deliberate reasoning and guiding future improvements, including multilingual extensions and broader prompting strategies.

Abstract

Large Language Models (LLMs) have shown impressive performance on various benchmarks, yet their ability to engage in deliberate reasoning remains questionable. We present NYT-Connections, a collection of 358 simple word classification puzzles derived from the New York Times Connections game. This benchmark is designed to penalize quick, intuitive "System 1" thinking, isolating fundamental reasoning skills. We evaluated six recent LLMs, a simple machine learning heuristic, and humans across three configurations: single-attempt, multiple attempts without hints, and multiple attempts with contextual hints. Our findings reveal a significant performance gap: even top-performing LLMs like GPT-4 fall short of human performance by nearly 30%. Notably, advanced prompting techniques such as Chain-of-Thought and Self-Consistency show diminishing returns as task difficulty increases. NYT-Connections uniquely combines linguistic isolation, resistance to intuitive shortcuts, and regular updates to mitigate data leakage, offering a novel tool for assessing LLM reasoning capabilities.

Paper Structure

This paper contains 33 sections, 1 equation, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Overview of Connections game instance and its embeddings visualization.
  • Figure 2: An example of GPT-4's output demonstrating the shallow reasoning of Chain-of-Thought-based approaches. The model first latches on to words in a laundry category, while in the second example, the model correctly identifies the group but fails to produce effective word groupings.
  • Figure 3: Average performance vs difficulty level for GPT-4 with various prompting techniques on Full Hints
  • Figure 4: Demonstration of our three setups. One Try: Players have one chance to classify the words into the four groups. No Hints: Players have 4 chances to get the correct groups, where at each chance they are tasked to find a correct grouping. Full Hints: Same as No Hints, but players are told when they are one word away from a correct grouping.
  • Figure 5: Example output from GPT-4 for Full Hints configuration, showing the "One Away" hint being given to the player.
  • ...and 4 more figures