Are LLMs Good Cryptic Crossword Solvers?

Abdelrahman Sadallah; Daria Kotova; Ekaterina Kochmar

Are LLMs Good Cryptic Crossword Solvers?

Abdelrahman Sadallah, Daria Kotova, Ekaterina Kochmar

TL;DR

This work benchmarks large language models on cryptic crossword solving, a task that requires parsing definition and wordplay under enumeration constraints; using LLaMA2-7b, Mistral-7b with QLoRA, and ChatGPT, the authors explore zero-shot, few-shot, and fine-tuning regimes on The Guardian cryptic clue dataset with multiple data splits. Results show near-zero zero-shot accuracy, modest gains from few-shot prompts, and limited benefits from fine-tuning, with open-source models lagging behind a strong ChatGPT baseline and well below human solvers. The study also demonstrates that mechanisms like partial letter reveals (word masks) can substantially improve performance, and that cross-dataset generalization remains challenging; the authors release data and code to support reproducibility and future research. Overall, the findings highlight the substantial gap between current LLM capabilities and human cryptic crossword solving, while proposing concrete avenues such as chain-of-thought prompting and curriculum-style training to push progress toward this complex language-understanding task.

Abstract

Cryptic crosswords are puzzles that rely not only on general knowledge but also on the solver's ability to manipulate language on different levels and deal with various types of wordplay. Previous research suggests that solving such puzzles is a challenge even for modern NLP models. However, the abilities of large language models (LLMs) have not yet been tested on this task. In this paper, we establish the benchmark results for three popular LLMs -- LLaMA2, Mistral, and ChatGPT -- showing that their performance on this task is still far from that of humans.

Are LLMs Good Cryptic Crossword Solvers?

TL;DR

Abstract

Paper Structure (25 sections, 5 figures, 8 tables)

This paper contains 25 sections, 5 figures, 8 tables.

Introduction
Related Work
LLMs' emergent capabilities
Solving puzzles with NLP models
Methodology
Prompt variation
Few-shot learning
LLM fine-tuning
Data
Data splits
Experiments
Can LLMs solve the clues given various prompts?
Can the models provide partially correct answers?
Few-shot learning: Do models learn from examples?
Do similar examples help?
...and 10 more sections

Figures (5)

Figure 1: An example of a cryptic clue: number 5 at the end of the clue denotes the number of characters in the answer and is called enumeration. The definition part here is likely to be language model, with the rest being the wordplay part. Beheads or similar words point to the first letters of the next word, while confused (as well as mixed up, etc.) is likely to indicate an anagram. As we should look for a language model's name that starts with the letter l plus an anagram of Alma and consists of 5 letters, the answer here is Llama.
Figure 2: The two prompts that we used across all our experiments. The base prompt contains simple instruction, while the extended prompt also includes explicit information about the answer format.
Figure 3: 3-shot learning input using random examples.
Figure 4: The full original output of the model is incorrect, however, it contains the correct answer. Using enumeration and cleaning, this answer can be extracted from the output.
Figure 5: Example of a prompt helping the model get the correct number of characters in the answer: the number of * symbols in the line after the clue corresponds to the number of letters in the expected answer. For further experiments, we replace some of those symbols with the correct letters of the answer in their respective positions.

Are LLMs Good Cryptic Crossword Solvers?

TL;DR

Abstract

Are LLMs Good Cryptic Crossword Solvers?

Authors

TL;DR

Abstract

Table of Contents

Figures (5)