Seq2Seq Models Reconstruct Visual Jigsaw Puzzles without Seeing Them
Gur Elkin, Ofir Itzhak Shahar, Ohad Ben-Shahar
TL;DR
The paper addresses reassembling square jigsaw puzzles without visual input by recasting the task as a sequence-to-sequence prediction problem using a novel puzzle tokenizer. It encodes each piece into a short border-oriented token sequence and uses an encoder–decoder Transformer to autoregressively predict a permutation $Y$ of piece placements, optimized with cross-entropy against the ground truth, measured via $Abs(Y,\hat{Y})=\frac{1}{N}\sum_{i=1}^N \mathbb{1}[y_i=\hat{y}_i]$ and $Perfect(Y,\hat{Y})=\prod_{i=1}^N \mathbb{1}[y_i=\hat{y}_i]$. Across ImageNet 3×3, JPwLEG, and missing-piece variants, the approach achieves state-of-the-art or competitive results without using raw images, and sometimes surpasses vision-based solvers. Analyses of token distributions reveal structured, partially Zipfian patterns and high token diversity, supporting the viability of language-driven puzzle solving and suggesting broader cross-domain applications.
Abstract
Jigsaw puzzles are primarily visual objects, whose algorithmic solutions have traditionally been framed from a visual perspective. In this work, however, we explore a fundamentally different approach: solving square jigsaw puzzles using language models, without access to raw visual input. By introducing a specialized tokenizer that converts each puzzle piece into a discrete sequence of tokens, we reframe puzzle reassembly as a sequence-to-sequence prediction task. Treated as "blind" solvers, encoder-decoder transformers accurately reconstruct the original layout by reasoning over token sequences alone. Despite being deliberately restricted from accessing visual input, our models achieve state-of-the-art results across multiple benchmarks, often outperforming vision-based methods. These findings highlight the surprising capability of language models to solve problems beyond their native domain, and suggest that unconventional approaches can inspire promising directions for puzzle-solving research.
