GenVP: Generating Visual Puzzles with Contrastive Hierarchical VAEs
Kalliopi Basioti, Pritish Sahu, Qingze Tony Liu, Zihao Xu, Hao Wang, Vladimir Pavlovic
TL;DR
GenVP advances abstract visual reasoning by modeling RPM generation and solving within a hierarchical HVAE framework augmented with a Mixture of Experts for robust rule inference. It introduces a dual contrastive learning scheme—global (cross-puzzle) and local (cross-candidate)—to strengthen rule representation and generalization, and enables generation of complete RPM matrices from abstract rules. Across five AVR datasets and challenging out-of-distribution scenarios, GenVP achieves state-of-the-art puzzle-solving accuracy and demonstrates strong generalization to unseen attributes and large solution spaces, while also generating coherent, rule-consistent RPMs. The combination of generative capability, robust rule disentanglement, and scalable inference positions GenVP as a versatile tool for both RPM solving and AI creativity in puzzle design and high-level visual reasoning.
Abstract
Raven's Progressive Matrices (RPMs) is an established benchmark to examine the ability to perform high-level abstract visual reasoning (AVR). Despite the current success of algorithms that solve this task, humans can generalize beyond a given puzzle and create new puzzles given a set of rules, whereas machines remain locked in solving a fixed puzzle from a curated choice list. We propose Generative Visual Puzzles (GenVP), a framework to model the entire RPM generation process, a substantially more challenging task. Our model's capability spans from generating multiple solutions for one specific problem prompt to creating complete new puzzles out of the desired set of rules. Experiments on five different datasets indicate that GenVP achieves state-of-the-art (SOTA) performance both in puzzle-solving accuracy and out-of-distribution (OOD) generalization in 22 OOD scenarios. Compared to SOTA generative approaches, which struggle to solve RPMs when the feasible solution space increases, GenVP efficiently generalizes to these challenging setups. Moreover, our model demonstrates the ability to produce a wide range of complete RPMs given a set of abstract rules by effectively capturing the relationships between abstract rules and visual object properties.
