Table of Contents
Fetching ...

Missed Connections: Lateral Thinking Puzzles for Large Language Models

Graham Todd, Tim Merino, Sam Earle, Julian Togelius

TL;DR

The paper investigates whether large language models and sentence embeddings can solve the NYT Connections puzzle, using a 250-puzzle dataset as a benchmark. It compares a sentence-embedding baseline (MPNet) with GPT-3.5 and GPT-4-turbo, and examines the impact of chain-of-thought prompting and a more difficult all-in-one variant. Results show GPT-4-turbo achieving the highest accuracy (up to 29.2% without CoT and 38.93% with CoT) but overall performance is far from perfect, highlighting both successes and failure modes in semantic reasoning. The study argues that Connections is a valuable, scalable test-bed for probing semantic representations and abstract associations in data-driven linguistic systems and outlines directions for improvement.

Abstract

The Connections puzzle published each day by the New York Times tasks players with dividing a bank of sixteen words into four groups of four words that each relate to a common theme. Solving the puzzle requires both common linguistic knowledge (i.e. definitions and typical usage) as well as, in many cases, lateral or abstract thinking. This is because the four categories ascend in complexity, with the most challenging category often requiring thinking about words in uncommon ways or as parts of larger phrases. We investigate the capacity for automated AI systems to play Connections and explore the game's potential as an automated benchmark for abstract reasoning and a way to measure the semantic information encoded by data-driven linguistic systems. In particular, we study both a sentence-embedding baseline and modern large language models (LLMs). We report their accuracy on the task, measure the impacts of chain-of-thought prompting, and discuss their failure modes. Overall, we find that the Connections task is challenging yet feasible, and a strong test-bed for future work.

Missed Connections: Lateral Thinking Puzzles for Large Language Models

TL;DR

The paper investigates whether large language models and sentence embeddings can solve the NYT Connections puzzle, using a 250-puzzle dataset as a benchmark. It compares a sentence-embedding baseline (MPNet) with GPT-3.5 and GPT-4-turbo, and examines the impact of chain-of-thought prompting and a more difficult all-in-one variant. Results show GPT-4-turbo achieving the highest accuracy (up to 29.2% without CoT and 38.93% with CoT) but overall performance is far from perfect, highlighting both successes and failure modes in semantic reasoning. The study argues that Connections is a valuable, scalable test-bed for probing semantic representations and abstract associations in data-driven linguistic systems and outlines directions for improvement.

Abstract

The Connections puzzle published each day by the New York Times tasks players with dividing a bank of sixteen words into four groups of four words that each relate to a common theme. Solving the puzzle requires both common linguistic knowledge (i.e. definitions and typical usage) as well as, in many cases, lateral or abstract thinking. This is because the four categories ascend in complexity, with the most challenging category often requiring thinking about words in uncommon ways or as parts of larger phrases. We investigate the capacity for automated AI systems to play Connections and explore the game's potential as an automated benchmark for abstract reasoning and a way to measure the semantic information encoded by data-driven linguistic systems. In particular, we study both a sentence-embedding baseline and modern large language models (LLMs). We report their accuracy on the task, measure the impacts of chain-of-thought prompting, and discuss their failure modes. Overall, we find that the Connections task is challenging yet feasible, and a strong test-bed for future work.
Paper Structure (17 sections, 3 figures, 2 tables)

This paper contains 17 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: An example Connections puzzle taken from the New York Times web interface on November 28th, 2023
  • Figure 2: Average success rate across all puzzles and seeds for baseline models and LLMs, broken down by puzzle category (note that CoT indicates the use of chain-of-thought prompting). Categories increase in difficulty going from yellow to green to blue to purple. First, we see that category difficulty generally aligns with success rate across models. Between models, we see that LLMs generally outperform the baseline, at best solving a sizeable proportion of puzzles (but not a majority). Finally, we see that chain-of-thought prompting provides a notable boost to LLM performance.
  • Figure 3: Proportion of all categories and overall puzzles solved by the MPNet sentence embedding baseline with an increasing amount of allowed guesses. Notably, half of all puzzles are solved within 29 incorrect guesses and all puzzles are solved within 417 guesses. On the challenge variant, only 108 puzzles are solved within 500 incorrect guesses.