VLHSA: Vision-Language Hierarchical Semantic Alignment for Jigsaw Puzzle Solving with Eroded Gaps
Zhuoning Xu, Xinyan Liu
TL;DR
The paper tackles jigsaw puzzle solving with fragment gaps by leveraging semantic guidance from natural language. It introduces Vision-Language Hierarchical Semantic Alignment (VLHSA), which aligns visual patches with textual descriptions at token, region, and global levels using dual encoders (Vision Mamba and BLIP) and CLIP text features, with Hungarian assignment for final placement. Empirical results on JPwLEG-3 and JPwLEG-5 demonstrate state-of-the-art performance, notably a 14.2 percentage point gain in piece accuracy on JPwLEG-5 and a 19.0% Perfect reconstruction rate, with ablations confirming the crucial role of global alignment and the value of multimodal fusion. The work establishes a new multimodal paradigm for spatial reconstruction, showing that semantic language cues can significantly augment vision-only puzzle solvers in challenging, gap-filled scenarios.
Abstract
Jigsaw puzzle solving remains challenging in computer vision, requiring an understanding of both local fragment details and global spatial relationships. While most traditional approaches only focus on visual cues like edge matching and visual coherence, few methods explore natural language descriptions for semantic guidance in challenging scenarios, especially for eroded gap puzzles. We propose a vision-language framework that leverages textual context to enhance puzzle assembly performance. Our approach centers on the Vision-Language Hierarchical Semantic Alignment (VLHSA) module, which aligns visual patches with textual descriptions through multi-level semantic matching from local tokens to global context. Also, a multimodal architecture that combines dual visual encoders with language features for cross-modal reasoning is integrated into this module. Experiments demonstrate that our method significantly outperforms state-of-the-art models across various datasets, achieving substantial improvements, including a 14.2 percentage point gain in piece accuracy. Ablation studies confirm the critical role of the VLHSA module in driving improvements over vision-only approaches. Our work establishes a new paradigm for jigsaw puzzle solving by incorporating multimodal semantic insights.
