Non Verbis, Sed Rebus: Large Language Models are Weak Solvers of Italian Rebuses

Gabriele Sarti; Tommaso Caselli; Malvina Nissim; Arianna Bisazza

Non Verbis, Sed Rebus: Large Language Models are Weak Solvers of Italian Rebuses

Gabriele Sarti, Tommaso Caselli, Malvina Nissim, Arianna Bisazza

TL;DR

The paper tackles the challenge of evaluating Italian rebus solving by LLMs through a newly built verbalized rebus dataset (EurekaRebus) aligned with ItaCW definitions, resulting in 83k training examples and a 2k test set. It demonstrates that state-of-the-art prompted LLMs underperform on this constrained, multi-step reasoning task (best ~24%), while a small Phi-3 Mini (8B) fine-tuned with QLoRA achieves about 51% solution accuracy, largely due to memorization rather than robust generalization. Through granular word-level analysis and out-of-distribution testing, the study exposes the limits of current LLMs in sequential instruction-following and constrained puzzle solving, and discusses potential improvements via constraint-aware search and broader language coverage. Overall, the work provides a valuable benchmark for evaluating linguistic proficiency and sequential reasoning in LLMs and highlights significant gaps that motivate future multimodal and cross-linguistic investigations.

Abstract

Rebuses are puzzles requiring constrained multi-step reasoning to identify a hidden phrase from a set of images and letters. In this work, we introduce a large collection of verbalized rebuses for the Italian language and use it to assess the rebus-solving capabilities of state-of-the-art large language models. While general-purpose systems such as LLaMA-3 and GPT-4o perform poorly on this task, ad-hoc fine-tuning seems to improve models' performance. However, we find that performance gains from training are largely motivated by memorization. Our results suggest that rebus solving remains a challenging test bed to evaluate large language models' linguistic proficiency and sequential instruction-following skills.

Non Verbis, Sed Rebus: Large Language Models are Weak Solvers of Italian Rebuses

TL;DR

Abstract

Paper Structure (22 sections, 2 figures, 11 tables)

This paper contains 22 sections, 2 figures, 11 tables.

Introduction
Background and Related Work
Italian Enigmistica and Rebuses
Linguistic Puzzles as NLP Progress Metrics
LLMs as Sequential Reasoners
Experimental Setup
Data
Models
Format
Metrics
Results
What Motivates Model Performances?
Word Complexity and Frequency Affects LLM Fine-tuning Performance
LLM Fine-Tuning Fails to Generalize to Unseen Words
Manual Inspection
...and 7 more sections

Figures (2)

Figure 1: An example of a verbalized rebus crafted by combining a rebus first pass (intermediate solution) with crossword definitions. We use verbalized rebuses to test LLMs' sequential instruction following capabilities. Image from Settimana Enigmistica n. 4656, © Bresi S.r.l.
Figure 2: Word frequencies for words in first passes (top) and solutions (bottom) for the selected subset of EurekaRebus used for training and evaluation. Words are colored according to their length, and the most frequent examples per frequency bin are highlighted.

Non Verbis, Sed Rebus: Large Language Models are Weak Solvers of Italian Rebuses

TL;DR

Abstract

Non Verbis, Sed Rebus: Large Language Models are Weak Solvers of Italian Rebuses

Authors

TL;DR

Abstract

Table of Contents

Figures (2)