Table of Contents
Fetching ...

MultiZebraLogic: A Multilingual Logical Reasoning Benchmark

Sofie Helene Bruun, Dan Saattrup Smart

TL;DR

MultiZebraLogic introduces a multilingual, constraint-satisfaction based zebra-puzzle benchmark to evaluate logical reasoning in LLMs across languages and difficulty. The system generates puzzles with 14 clue types and 8 red herrings, translates them into nine Germanic languages, and evaluates two LLMs (o3-mini and GPT-4o mini) on sizes $2\times3$ and $4\times5$ using metrics $A_{\mathrm{puzzle}}$ and $A_{\mathrm{cell}}$. Results show o3-mini generally outperforms GPT-4o mini, with red herrings and language effects modulating difficulty; no single clue type consistently predicts difficulty. Datasets with $128$ training and $1024$ testing puzzles per size are released across languages, along with generation code, enabling easy extension to more languages and themes. The benchmark offers a scalable, multilingual framework for comprehensively benchmarking logical reasoning in current and future LLMs.

Abstract

Measuring the full abilities of large language models (LLMs) requires benchmarks representing multiple tasks. We aim to create large, high-quality datasets for comparison of logical reasoning skills across several languages and of suitable difficulty for LLMs of various reasoning ability. We explore multiple ways of increasing difficulty. We generate zebra puzzles in multiple languages, themes, sizes and including 14 different clue types and 8 red herring types (uninformative clues). We find puzzle sizes 2x3 and 4x5 are sufficiently challenging for GPT-4o mini (a non-reasoning model) and o3-mini (a reasoning model), respectively. Including 5 red herrings decreases o3-mini puzzle-level accuracy on 4x5 puzzles by 15$\pm$7 %. Scores of o3-mini on 4x5 puzzles are not significantly affected by use of English vs. Danish or the common houses theme vs. the country-specific smoerrebroed theme. We find no correlation between difficulty and the selected clue types. Datasets of 128+1024 puzzles are published as MultiZebraLogic in each of nine Germanic languages for sizes 2x3 and 4x5. We publish code for puzzle generation, designed for adaptablity into more languages and themes.

MultiZebraLogic: A Multilingual Logical Reasoning Benchmark

TL;DR

MultiZebraLogic introduces a multilingual, constraint-satisfaction based zebra-puzzle benchmark to evaluate logical reasoning in LLMs across languages and difficulty. The system generates puzzles with 14 clue types and 8 red herrings, translates them into nine Germanic languages, and evaluates two LLMs (o3-mini and GPT-4o mini) on sizes and using metrics and . Results show o3-mini generally outperforms GPT-4o mini, with red herrings and language effects modulating difficulty; no single clue type consistently predicts difficulty. Datasets with training and testing puzzles per size are released across languages, along with generation code, enabling easy extension to more languages and themes. The benchmark offers a scalable, multilingual framework for comprehensively benchmarking logical reasoning in current and future LLMs.

Abstract

Measuring the full abilities of large language models (LLMs) requires benchmarks representing multiple tasks. We aim to create large, high-quality datasets for comparison of logical reasoning skills across several languages and of suitable difficulty for LLMs of various reasoning ability. We explore multiple ways of increasing difficulty. We generate zebra puzzles in multiple languages, themes, sizes and including 14 different clue types and 8 red herring types (uninformative clues). We find puzzle sizes 2x3 and 4x5 are sufficiently challenging for GPT-4o mini (a non-reasoning model) and o3-mini (a reasoning model), respectively. Including 5 red herrings decreases o3-mini puzzle-level accuracy on 4x5 puzzles by 157 %. Scores of o3-mini on 4x5 puzzles are not significantly affected by use of English vs. Danish or the common houses theme vs. the country-specific smoerrebroed theme. We find no correlation between difficulty and the selected clue types. Datasets of 128+1024 puzzles are published as MultiZebraLogic in each of nine Germanic languages for sizes 2x3 and 4x5. We publish code for puzzle generation, designed for adaptablity into more languages and themes.

Paper Structure

This paper contains 28 sections, 6 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: $\overline{A_{\mathrm{puzzle}}}$ (upper row) and $\overline{A_{\mathrm{cell}}}$ (lower row) for GPT-4o mini (left column) and o3-mini (right column) for 100 puzzles with 5 red herrings in the Danish smørrebrød theme. Sample standard deviations show the spread of $A_{\mathrm{cell}}$ (set to 0 for equal values). For $A_{\mathrm{puzzle}}$, the mean values include all information. Sizes marked in grey are not evaluated. o3-mini performs better than GPT-4o mini for all evaluated sizes.
  • Figure 2: $\Delta\overline{A_{\mathrm{puzzle}}}$ for o3-mini with 0 vs. 5 red herrings for 100 puzzles in the Danish smørrebrød theme. Using 5 red herrings gives a $>2\sigma$ decrease in $\overline{A_{\mathrm{puzzle}}}$ for sizes 3×2, 3×3, 3×5, 4×4, and 4×5.
  • Figure 3: Mean normalised frequencies of all clue types in puzzles with the Danish smørrebrød theme and 5 red herrings. To the right of the red line, all 'clues' are red herrings. Some clue types are only used above certain puzzle sizes -- see Table \ref{['tab:clue_types']}. Frequently selected clues are typically more informative.
  • Figure 4: Difference in mean accuracy between o3-mini and GPT-4o mini for 100 puzzles with 5 red herrings in the Danish smørrebrød theme. The upper plot shows $\Delta\overline{A_{\mathrm{puzzle}}}$ and the lower shows $\Delta\overline{A_{\mathrm{cell}}}$. The uncertainties are the standard deviations of the differences in mean accuracy.
  • Figure 5: $\Delta\overline{A_{\mathrm{puzzle}}}$ for o3-mini with 0 vs. 1 red herrings for 100 puzzles in the Danish smørrebrød theme. Including 1 red herring slightly decreases $\overline{A_{\mathrm{puzzle}}}$, but the effect is not consistent across puzzle sizes.
  • ...and 1 more figures