Are LLMs Good Cryptic Crossword Solvers?
Abdelrahman Sadallah, Daria Kotova, Ekaterina Kochmar
TL;DR
This work benchmarks large language models on cryptic crossword solving, a task that requires parsing definition and wordplay under enumeration constraints; using LLaMA2-7b, Mistral-7b with QLoRA, and ChatGPT, the authors explore zero-shot, few-shot, and fine-tuning regimes on The Guardian cryptic clue dataset with multiple data splits. Results show near-zero zero-shot accuracy, modest gains from few-shot prompts, and limited benefits from fine-tuning, with open-source models lagging behind a strong ChatGPT baseline and well below human solvers. The study also demonstrates that mechanisms like partial letter reveals (word masks) can substantially improve performance, and that cross-dataset generalization remains challenging; the authors release data and code to support reproducibility and future research. Overall, the findings highlight the substantial gap between current LLM capabilities and human cryptic crossword solving, while proposing concrete avenues such as chain-of-thought prompting and curriculum-style training to push progress toward this complex language-understanding task.
Abstract
Cryptic crosswords are puzzles that rely not only on general knowledge but also on the solver's ability to manipulate language on different levels and deal with various types of wordplay. Previous research suggests that solving such puzzles is a challenge even for modern NLP models. However, the abilities of large language models (LLMs) have not yet been tested on this task. In this paper, we establish the benchmark results for three popular LLMs -- LLaMA2, Mistral, and ChatGPT -- showing that their performance on this task is still far from that of humans.
