Table of Contents
Fetching ...

VLHSA: Vision-Language Hierarchical Semantic Alignment for Jigsaw Puzzle Solving with Eroded Gaps

Zhuoning Xu, Xinyan Liu

TL;DR

The paper tackles jigsaw puzzle solving with fragment gaps by leveraging semantic guidance from natural language. It introduces Vision-Language Hierarchical Semantic Alignment (VLHSA), which aligns visual patches with textual descriptions at token, region, and global levels using dual encoders (Vision Mamba and BLIP) and CLIP text features, with Hungarian assignment for final placement. Empirical results on JPwLEG-3 and JPwLEG-5 demonstrate state-of-the-art performance, notably a 14.2 percentage point gain in piece accuracy on JPwLEG-5 and a 19.0% Perfect reconstruction rate, with ablations confirming the crucial role of global alignment and the value of multimodal fusion. The work establishes a new multimodal paradigm for spatial reconstruction, showing that semantic language cues can significantly augment vision-only puzzle solvers in challenging, gap-filled scenarios.

Abstract

Jigsaw puzzle solving remains challenging in computer vision, requiring an understanding of both local fragment details and global spatial relationships. While most traditional approaches only focus on visual cues like edge matching and visual coherence, few methods explore natural language descriptions for semantic guidance in challenging scenarios, especially for eroded gap puzzles. We propose a vision-language framework that leverages textual context to enhance puzzle assembly performance. Our approach centers on the Vision-Language Hierarchical Semantic Alignment (VLHSA) module, which aligns visual patches with textual descriptions through multi-level semantic matching from local tokens to global context. Also, a multimodal architecture that combines dual visual encoders with language features for cross-modal reasoning is integrated into this module. Experiments demonstrate that our method significantly outperforms state-of-the-art models across various datasets, achieving substantial improvements, including a 14.2 percentage point gain in piece accuracy. Ablation studies confirm the critical role of the VLHSA module in driving improvements over vision-only approaches. Our work establishes a new paradigm for jigsaw puzzle solving by incorporating multimodal semantic insights.

VLHSA: Vision-Language Hierarchical Semantic Alignment for Jigsaw Puzzle Solving with Eroded Gaps

TL;DR

The paper tackles jigsaw puzzle solving with fragment gaps by leveraging semantic guidance from natural language. It introduces Vision-Language Hierarchical Semantic Alignment (VLHSA), which aligns visual patches with textual descriptions at token, region, and global levels using dual encoders (Vision Mamba and BLIP) and CLIP text features, with Hungarian assignment for final placement. Empirical results on JPwLEG-3 and JPwLEG-5 demonstrate state-of-the-art performance, notably a 14.2 percentage point gain in piece accuracy on JPwLEG-5 and a 19.0% Perfect reconstruction rate, with ablations confirming the crucial role of global alignment and the value of multimodal fusion. The work establishes a new multimodal paradigm for spatial reconstruction, showing that semantic language cues can significantly augment vision-only puzzle solvers in challenging, gap-filled scenarios.

Abstract

Jigsaw puzzle solving remains challenging in computer vision, requiring an understanding of both local fragment details and global spatial relationships. While most traditional approaches only focus on visual cues like edge matching and visual coherence, few methods explore natural language descriptions for semantic guidance in challenging scenarios, especially for eroded gap puzzles. We propose a vision-language framework that leverages textual context to enhance puzzle assembly performance. Our approach centers on the Vision-Language Hierarchical Semantic Alignment (VLHSA) module, which aligns visual patches with textual descriptions through multi-level semantic matching from local tokens to global context. Also, a multimodal architecture that combines dual visual encoders with language features for cross-modal reasoning is integrated into this module. Experiments demonstrate that our method significantly outperforms state-of-the-art models across various datasets, achieving substantial improvements, including a 14.2 percentage point gain in piece accuracy. Ablation studies confirm the critical role of the VLHSA module in driving improvements over vision-only approaches. Our work establishes a new paradigm for jigsaw puzzle solving by incorporating multimodal semantic insights.

Paper Structure

This paper contains 36 sections, 16 equations, 3 figures, 9 tables, 1 algorithm.

Figures (3)

  • Figure 1: Overall framework structure of VLHSA for solving jigsaw puzzles with fragment gaps. The model integrates visual features from Vision Mamba and BLIP, while CLIP captures semantic information from captions to guide the alignment. Hierarchical correspondence between visual and textual features is established at the token, region, and global levels. These fused multimodal representations are used to predict piece positions, with the optimal assignment determined by the Hungarian algorithm.
  • Figure 2: Qualitative results on JPwLEG-5 dataset. Top row shows scrambled input, middle row shows ground truth, bottom row shows our reconstruction. Green boxes mark correctly placed pieces, red boxes show errors.
  • Figure 3: Qualitative results on JPwLEG-3 dataset. Green boxes mark correctly placed pieces, red boxes show errors.