Table of Contents
Fetching ...

What Makes Cryptic Crosswords Challenging for LLMs?

Abdelrahman Sadallah, Daria Kotova, Ekaterina Kochmar

TL;DR

This work investigates why modern LLMs struggle with cryptic crosswords, proposing a targeted, interpretability-focused evaluation. It benchmarks Gemma2, LLaMA3, and ChatGPT on cryptic clues using zero-shot prompts and dissects performance via three auxiliary tasks: definition extraction, wordplay-type detection, and explanation extraction. The authors introduce a small, annotated wordplay-dataset and provide a reproducible codebase and data; results show that despite some gains from prompting strategies and task decomposition, LLMs remain far below human performance, especially in wordplay understanding. They discuss future directions such as chain-of-thought and curriculum learning to bridge the gap and acknowledge dataset and contamination limitations.

Abstract

Cryptic crosswords are puzzles that rely on general knowledge and the solver's ability to manipulate language on different levels, dealing with various types of wordplay. Previous research suggests that solving such puzzles is challenging even for modern NLP models, including Large Language Models (LLMs). However, there is little to no research on the reasons for their poor performance on this task. In this paper, we establish the benchmark results for three popular LLMs: Gemma2, LLaMA3 and ChatGPT, showing that their performance on this task is still significantly below that of humans. We also investigate why these models struggle to achieve superior performance. We release our code and introduced datasets at https://github.com/bodasadallah/decrypting-crosswords.

What Makes Cryptic Crosswords Challenging for LLMs?

TL;DR

This work investigates why modern LLMs struggle with cryptic crosswords, proposing a targeted, interpretability-focused evaluation. It benchmarks Gemma2, LLaMA3, and ChatGPT on cryptic clues using zero-shot prompts and dissects performance via three auxiliary tasks: definition extraction, wordplay-type detection, and explanation extraction. The authors introduce a small, annotated wordplay-dataset and provide a reproducible codebase and data; results show that despite some gains from prompting strategies and task decomposition, LLMs remain far below human performance, especially in wordplay understanding. They discuss future directions such as chain-of-thought and curriculum learning to bridge the gap and acknowledge dataset and contamination limitations.

Abstract

Cryptic crosswords are puzzles that rely on general knowledge and the solver's ability to manipulate language on different levels, dealing with various types of wordplay. Previous research suggests that solving such puzzles is challenging even for modern NLP models, including Large Language Models (LLMs). However, there is little to no research on the reasons for their poor performance on this task. In this paper, we establish the benchmark results for three popular LLMs: Gemma2, LLaMA3 and ChatGPT, showing that their performance on this task is still significantly below that of humans. We also investigate why these models struggle to achieve superior performance. We release our code and introduced datasets at https://github.com/bodasadallah/decrypting-crosswords.

Paper Structure

This paper contains 28 sections, 11 figures, 3 tables.

Figures (11)

  • Figure 1: An example of a cryptic clue: number 5 at the end of the clue denotes the number of characters in the answer and is called enumeration. The definition part here is language model, with the rest being the wordplay part. Beheads or similar words point to the first letters of the next word, while confused (as well as mixed up, etc.) is likely to indicate an anagram. As we should look for a language model's name that starts with the letter l plus an anagram of Alma and consists of 5 letters, the answer here is LLaMA.
  • Figure C1: Confusion matrix for LLaMA3 on wordplay type prediction using the most informative prompt \ref{['fig:wordplay_types_examples_answer']}.
  • Figure C2: Confusion matrix for Gemma on wordplay type prediction using the most informative prompt \ref{['fig:wordplay_types_examples_answer']}.
  • Figure C3: Confusion matrix for ChatGPT on wordplay type prediction using the most informative prompt \ref{['fig:wordplay_types_examples_answer']}.
  • Figure E1: Base prompt.
  • ...and 6 more figures