Table of Contents
Fetching ...

Seq2Seq Models Reconstruct Visual Jigsaw Puzzles without Seeing Them

Gur Elkin, Ofir Itzhak Shahar, Ohad Ben-Shahar

TL;DR

The paper addresses reassembling square jigsaw puzzles without visual input by recasting the task as a sequence-to-sequence prediction problem using a novel puzzle tokenizer. It encodes each piece into a short border-oriented token sequence and uses an encoder–decoder Transformer to autoregressively predict a permutation $Y$ of piece placements, optimized with cross-entropy against the ground truth, measured via $Abs(Y,\hat{Y})=\frac{1}{N}\sum_{i=1}^N \mathbb{1}[y_i=\hat{y}_i]$ and $Perfect(Y,\hat{Y})=\prod_{i=1}^N \mathbb{1}[y_i=\hat{y}_i]$. Across ImageNet 3×3, JPwLEG, and missing-piece variants, the approach achieves state-of-the-art or competitive results without using raw images, and sometimes surpasses vision-based solvers. Analyses of token distributions reveal structured, partially Zipfian patterns and high token diversity, supporting the viability of language-driven puzzle solving and suggesting broader cross-domain applications.

Abstract

Jigsaw puzzles are primarily visual objects, whose algorithmic solutions have traditionally been framed from a visual perspective. In this work, however, we explore a fundamentally different approach: solving square jigsaw puzzles using language models, without access to raw visual input. By introducing a specialized tokenizer that converts each puzzle piece into a discrete sequence of tokens, we reframe puzzle reassembly as a sequence-to-sequence prediction task. Treated as "blind" solvers, encoder-decoder transformers accurately reconstruct the original layout by reasoning over token sequences alone. Despite being deliberately restricted from accessing visual input, our models achieve state-of-the-art results across multiple benchmarks, often outperforming vision-based methods. These findings highlight the surprising capability of language models to solve problems beyond their native domain, and suggest that unconventional approaches can inspire promising directions for puzzle-solving research.

Seq2Seq Models Reconstruct Visual Jigsaw Puzzles without Seeing Them

TL;DR

The paper addresses reassembling square jigsaw puzzles without visual input by recasting the task as a sequence-to-sequence prediction problem using a novel puzzle tokenizer. It encodes each piece into a short border-oriented token sequence and uses an encoder–decoder Transformer to autoregressively predict a permutation of piece placements, optimized with cross-entropy against the ground truth, measured via and . Across ImageNet 3×3, JPwLEG, and missing-piece variants, the approach achieves state-of-the-art or competitive results without using raw images, and sometimes surpasses vision-based solvers. Analyses of token distributions reveal structured, partially Zipfian patterns and high token diversity, supporting the viability of language-driven puzzle solving and suggesting broader cross-domain applications.

Abstract

Jigsaw puzzles are primarily visual objects, whose algorithmic solutions have traditionally been framed from a visual perspective. In this work, however, we explore a fundamentally different approach: solving square jigsaw puzzles using language models, without access to raw visual input. By introducing a specialized tokenizer that converts each puzzle piece into a discrete sequence of tokens, we reframe puzzle reassembly as a sequence-to-sequence prediction task. Treated as "blind" solvers, encoder-decoder transformers accurately reconstruct the original layout by reasoning over token sequences alone. Despite being deliberately restricted from accessing visual input, our models achieve state-of-the-art results across multiple benchmarks, often outperforming vision-based methods. These findings highlight the surprising capability of language models to solve problems beyond their native domain, and suggest that unconventional approaches can inspire promising directions for puzzle-solving research.

Paper Structure

This paper contains 21 sections, 3 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Overview of our approach. By tokenizing the puzzle as a discrete sequence, we create a buffer between pictorial features and the language model's learned embeddings, guiding reassembly without accessing the raw image data. To our best knowledge, no previous method explored the possibility of such "blind" reconstruction.
  • Figure 2: Tokenization Process. Each of the $N$ shuffled pieces is divided into $T \times T$patches (above $N=4$, $T=4$). Next, all patches are projected to a lower-dimensional space via a PCA matrix. We then associate each patch with the index of its nearest centroid (through $k$-means clustering). Lastly, for each piece we retain only the $\tau = 4(T-1)$ patches that lie on its border, chaining them clockwise into a super-token. The puzzle is then represented as the concatenation of all $N$ super-tokens.
  • Figure 3: Average per-puzzle Shannon entropy scores. The tokenized pieces exhibit lower entropy compared to a uniformly random sequence of the same length, implying an inherent structure to the data.
  • Figure 4: Zipf's law for tokenized puzzles. While most tokens are proportional to 1/Rank, the least frequent ones are much more scarce compared to the law's prediction.
  • Figure 5: Heaps' law for tokenized puzzles. We observe a steeper curve compared to the theoretical line of $n^{0.5}$, suggesting a higher token diversity.
  • ...and 3 more figures