MultiZebraLogic: A Multilingual Logical Reasoning Benchmark

Sofie Helene Bruun; Dan Saattrup Smart

MultiZebraLogic: A Multilingual Logical Reasoning Benchmark

Sofie Helene Bruun, Dan Saattrup Smart

TL;DR

MultiZebraLogic introduces a multilingual, constraint-satisfaction based zebra-puzzle benchmark to evaluate logical reasoning in LLMs across languages and difficulty. The system generates puzzles with 14 clue types and 8 red herrings, translates them into nine Germanic languages, and evaluates two LLMs (o3-mini and GPT-4o mini) on sizes $2\times3$ and $4\times5$ using metrics $A_{\mathrm{puzzle}}$ and $A_{\mathrm{cell}}$. Results show o3-mini generally outperforms GPT-4o mini, with red herrings and language effects modulating difficulty; no single clue type consistently predicts difficulty. Datasets with $128$ training and $1024$ testing puzzles per size are released across languages, along with generation code, enabling easy extension to more languages and themes. The benchmark offers a scalable, multilingual framework for comprehensively benchmarking logical reasoning in current and future LLMs.

Abstract

Measuring the full abilities of large language models (LLMs) requires benchmarks representing multiple tasks. We aim to create large, high-quality datasets for comparison of logical reasoning skills across several languages and of suitable difficulty for LLMs of various reasoning ability. We explore multiple ways of increasing difficulty. We generate zebra puzzles in multiple languages, themes, sizes and including 14 different clue types and 8 red herring types (uninformative clues). We find puzzle sizes 2x3 and 4x5 are sufficiently challenging for GPT-4o mini (a non-reasoning model) and o3-mini (a reasoning model), respectively. Including 5 red herrings decreases o3-mini puzzle-level accuracy on 4x5 puzzles by 15$\pm$7 %. Scores of o3-mini on 4x5 puzzles are not significantly affected by use of English vs. Danish or the common houses theme vs. the country-specific smoerrebroed theme. We find no correlation between difficulty and the selected clue types. Datasets of 128+1024 puzzles are published as MultiZebraLogic in each of nine Germanic languages for sizes 2x3 and 4x5. We publish code for puzzle generation, designed for adaptablity into more languages and themes.

MultiZebraLogic: A Multilingual Logical Reasoning Benchmark

TL;DR

and

using metrics

and

. Results show o3-mini generally outperforms GPT-4o mini, with red herrings and language effects modulating difficulty; no single clue type consistently predicts difficulty. Datasets with

training and

testing puzzles per size are released across languages, along with generation code, enabling easy extension to more languages and themes. The benchmark offers a scalable, multilingual framework for comprehensively benchmarking logical reasoning in current and future LLMs.

Abstract

7 %. Scores of o3-mini on 4x5 puzzles are not significantly affected by use of English vs. Danish or the common houses theme vs. the country-specific smoerrebroed theme. We find no correlation between difficulty and the selected clue types. Datasets of 128+1024 puzzles are published as MultiZebraLogic in each of nine Germanic languages for sizes 2x3 and 4x5. We publish code for puzzle generation, designed for adaptablity into more languages and themes.

MultiZebraLogic: A Multilingual Logical Reasoning Benchmark

TL;DR

Abstract

MultiZebraLogic: A Multilingual Logical Reasoning Benchmark

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)