Generative Adversarial Gumbel MCTS for Abstract Visual Composition Generation
Zirui Zhao, Boye Niu, David Hsu, Wee Sun Lee
TL;DR
The paper tackles generating abstract visual compositions by arranging fixed geometric primitives under hard geometric constraints guided by weak textual prompts. It introduces Generative Adversarial Gumbel MCTS (GAG MCTS), a constraint-aware search framework paired with a vision-language reward model and adversarial reward refinement to ensure both feasibility and semantic alignment. Through Tangram Assembly and a rectangle packing variant, the method consistently outperforms diffusion, autoregressive, and PPO baselines, with larger gains as constraints tighten. This approach enables reliable, data-efficient abstract visual synthesis where pixel-space generators struggle, and highlights the value of explicit constraint enforcement coupled with semantic verification.
Abstract
We study abstract visual composition, in which identity is primarily determined by the spatial configuration and relations among a small set of geometric primitives (e.g., parts, symmetry, topology). They are invariant primarily to texture and photorealistic detail. Composing such structures from fixed components under geometric constraints and vague goal specification (such as text) is non-trivial due to combinatorial placement choices, limited data, and discrete feasibility (overlap-free, allowable orientations), which create a sparse solution manifold ill-suited to purely statistical pixel-space generators. We propose a constraint-guided framework that combines explicit geometric reasoning with neural semantics. An AlphaGo-style search enforces feasibility, while a fine-tuned vision-language model scores semantic alignment as reward signals. Our algorithm uses a policy network as a heuristic in Monte-Carlo Tree Search and fine-tunes the network via search-generated plans. Inspired by the Generative Adversarial Network, we use the generated instances for adversarial reward refinement. Over time, the generation should approach the actual data more closely when the reward model cannot distinguish between generated instances and ground-truth. In the Tangram Assembly task, our approach yields higher validity and semantic fidelity than diffusion and auto-regressive baselines, especially as constraints tighten.
