Non Verbis, Sed Rebus: Large Language Models are Weak Solvers of Italian Rebuses
Gabriele Sarti, Tommaso Caselli, Malvina Nissim, Arianna Bisazza
TL;DR
The paper tackles the challenge of evaluating Italian rebus solving by LLMs through a newly built verbalized rebus dataset (EurekaRebus) aligned with ItaCW definitions, resulting in 83k training examples and a 2k test set. It demonstrates that state-of-the-art prompted LLMs underperform on this constrained, multi-step reasoning task (best ~24%), while a small Phi-3 Mini (8B) fine-tuned with QLoRA achieves about 51% solution accuracy, largely due to memorization rather than robust generalization. Through granular word-level analysis and out-of-distribution testing, the study exposes the limits of current LLMs in sequential instruction-following and constrained puzzle solving, and discusses potential improvements via constraint-aware search and broader language coverage. Overall, the work provides a valuable benchmark for evaluating linguistic proficiency and sequential reasoning in LLMs and highlights significant gaps that motivate future multimodal and cross-linguistic investigations.
Abstract
Rebuses are puzzles requiring constrained multi-step reasoning to identify a hidden phrase from a set of images and letters. In this work, we introduce a large collection of verbalized rebuses for the Italian language and use it to assess the rebus-solving capabilities of state-of-the-art large language models. While general-purpose systems such as LLaMA-3 and GPT-4o perform poorly on this task, ad-hoc fine-tuning seems to improve models' performance. However, we find that performance gains from training are largely motivated by memorization. Our results suggest that rebus solving remains a challenging test bed to evaluate large language models' linguistic proficiency and sequential instruction-following skills.
