Table of Contents
Fetching ...

Non Verbis, Sed Rebus: Large Language Models are Weak Solvers of Italian Rebuses

Gabriele Sarti, Tommaso Caselli, Malvina Nissim, Arianna Bisazza

TL;DR

The paper tackles the challenge of evaluating Italian rebus solving by LLMs through a newly built verbalized rebus dataset (EurekaRebus) aligned with ItaCW definitions, resulting in 83k training examples and a 2k test set. It demonstrates that state-of-the-art prompted LLMs underperform on this constrained, multi-step reasoning task (best ~24%), while a small Phi-3 Mini (8B) fine-tuned with QLoRA achieves about 51% solution accuracy, largely due to memorization rather than robust generalization. Through granular word-level analysis and out-of-distribution testing, the study exposes the limits of current LLMs in sequential instruction-following and constrained puzzle solving, and discusses potential improvements via constraint-aware search and broader language coverage. Overall, the work provides a valuable benchmark for evaluating linguistic proficiency and sequential reasoning in LLMs and highlights significant gaps that motivate future multimodal and cross-linguistic investigations.

Abstract

Rebuses are puzzles requiring constrained multi-step reasoning to identify a hidden phrase from a set of images and letters. In this work, we introduce a large collection of verbalized rebuses for the Italian language and use it to assess the rebus-solving capabilities of state-of-the-art large language models. While general-purpose systems such as LLaMA-3 and GPT-4o perform poorly on this task, ad-hoc fine-tuning seems to improve models' performance. However, we find that performance gains from training are largely motivated by memorization. Our results suggest that rebus solving remains a challenging test bed to evaluate large language models' linguistic proficiency and sequential instruction-following skills.

Non Verbis, Sed Rebus: Large Language Models are Weak Solvers of Italian Rebuses

TL;DR

The paper tackles the challenge of evaluating Italian rebus solving by LLMs through a newly built verbalized rebus dataset (EurekaRebus) aligned with ItaCW definitions, resulting in 83k training examples and a 2k test set. It demonstrates that state-of-the-art prompted LLMs underperform on this constrained, multi-step reasoning task (best ~24%), while a small Phi-3 Mini (8B) fine-tuned with QLoRA achieves about 51% solution accuracy, largely due to memorization rather than robust generalization. Through granular word-level analysis and out-of-distribution testing, the study exposes the limits of current LLMs in sequential instruction-following and constrained puzzle solving, and discusses potential improvements via constraint-aware search and broader language coverage. Overall, the work provides a valuable benchmark for evaluating linguistic proficiency and sequential reasoning in LLMs and highlights significant gaps that motivate future multimodal and cross-linguistic investigations.

Abstract

Rebuses are puzzles requiring constrained multi-step reasoning to identify a hidden phrase from a set of images and letters. In this work, we introduce a large collection of verbalized rebuses for the Italian language and use it to assess the rebus-solving capabilities of state-of-the-art large language models. While general-purpose systems such as LLaMA-3 and GPT-4o perform poorly on this task, ad-hoc fine-tuning seems to improve models' performance. However, we find that performance gains from training are largely motivated by memorization. Our results suggest that rebus solving remains a challenging test bed to evaluate large language models' linguistic proficiency and sequential instruction-following skills.
Paper Structure (22 sections, 2 figures, 11 tables)

This paper contains 22 sections, 2 figures, 11 tables.

Figures (2)

  • Figure 1: An example of a verbalized rebus crafted by combining a rebus first pass (intermediate solution) with crossword definitions. We use verbalized rebuses to test LLMs' sequential instruction following capabilities. Image from Settimana Enigmistica n. 4656, © Bresi S.r.l.
  • Figure 2: Word frequencies for words in first passes (top) and solutions (bottom) for the selected subset of EurekaRebus used for training and evaluation. Words are colored according to their length, and the most frequent examples per frequency bin are highlighted.