Table of Contents
Fetching ...

Generative Adversarial Gumbel MCTS for Abstract Visual Composition Generation

Zirui Zhao, Boye Niu, David Hsu, Wee Sun Lee

TL;DR

The paper tackles generating abstract visual compositions by arranging fixed geometric primitives under hard geometric constraints guided by weak textual prompts. It introduces Generative Adversarial Gumbel MCTS (GAG MCTS), a constraint-aware search framework paired with a vision-language reward model and adversarial reward refinement to ensure both feasibility and semantic alignment. Through Tangram Assembly and a rectangle packing variant, the method consistently outperforms diffusion, autoregressive, and PPO baselines, with larger gains as constraints tighten. This approach enables reliable, data-efficient abstract visual synthesis where pixel-space generators struggle, and highlights the value of explicit constraint enforcement coupled with semantic verification.

Abstract

We study abstract visual composition, in which identity is primarily determined by the spatial configuration and relations among a small set of geometric primitives (e.g., parts, symmetry, topology). They are invariant primarily to texture and photorealistic detail. Composing such structures from fixed components under geometric constraints and vague goal specification (such as text) is non-trivial due to combinatorial placement choices, limited data, and discrete feasibility (overlap-free, allowable orientations), which create a sparse solution manifold ill-suited to purely statistical pixel-space generators. We propose a constraint-guided framework that combines explicit geometric reasoning with neural semantics. An AlphaGo-style search enforces feasibility, while a fine-tuned vision-language model scores semantic alignment as reward signals. Our algorithm uses a policy network as a heuristic in Monte-Carlo Tree Search and fine-tunes the network via search-generated plans. Inspired by the Generative Adversarial Network, we use the generated instances for adversarial reward refinement. Over time, the generation should approach the actual data more closely when the reward model cannot distinguish between generated instances and ground-truth. In the Tangram Assembly task, our approach yields higher validity and semantic fidelity than diffusion and auto-regressive baselines, especially as constraints tighten.

Generative Adversarial Gumbel MCTS for Abstract Visual Composition Generation

TL;DR

The paper tackles generating abstract visual compositions by arranging fixed geometric primitives under hard geometric constraints guided by weak textual prompts. It introduces Generative Adversarial Gumbel MCTS (GAG MCTS), a constraint-aware search framework paired with a vision-language reward model and adversarial reward refinement to ensure both feasibility and semantic alignment. Through Tangram Assembly and a rectangle packing variant, the method consistently outperforms diffusion, autoregressive, and PPO baselines, with larger gains as constraints tighten. This approach enables reliable, data-efficient abstract visual synthesis where pixel-space generators struggle, and highlights the value of explicit constraint enforcement coupled with semantic verification.

Abstract

We study abstract visual composition, in which identity is primarily determined by the spatial configuration and relations among a small set of geometric primitives (e.g., parts, symmetry, topology). They are invariant primarily to texture and photorealistic detail. Composing such structures from fixed components under geometric constraints and vague goal specification (such as text) is non-trivial due to combinatorial placement choices, limited data, and discrete feasibility (overlap-free, allowable orientations), which create a sparse solution manifold ill-suited to purely statistical pixel-space generators. We propose a constraint-guided framework that combines explicit geometric reasoning with neural semantics. An AlphaGo-style search enforces feasibility, while a fine-tuned vision-language model scores semantic alignment as reward signals. Our algorithm uses a policy network as a heuristic in Monte-Carlo Tree Search and fine-tunes the network via search-generated plans. Inspired by the Generative Adversarial Network, we use the generated instances for adversarial reward refinement. Over time, the generation should approach the actual data more closely when the reward model cannot distinguish between generated instances and ground-truth. In the Tangram Assembly task, our approach yields higher validity and semantic fidelity than diffusion and auto-regressive baselines, especially as constraints tighten.

Paper Structure

This paper contains 47 sections, 10 equations, 6 figures, 4 tables, 3 algorithms.

Figures (6)

  • Figure 1: The Tangram assembly task. The seven pieces are placed on the board to form the target shape, described by the text prompt such as "perched bird". We show that our GAG MCTS can generate semantically aligned abstract visual concepts under hard constraints and limited data, while diffusion models perform poorly.
  • Figure 2: The demo of the actions in Tangram assembly. (a) The pieces can only be placed when their anchor points align; (b) rotations are restricted to multiples of 45 degrees.
  • Figure 3: Demonstration of the proposed Generative Adversarial Gumbel MCTS. The black arrows indicate output data from the modules, and the purple arrows indicate gradient descent training.
  • Figure 4: The policy value network. We use a pre-trained Vision Transformer (ViT) and Bert, together with a transformer that fuses the vision and language features, to form the main body of the network. We use a single network to predict the value and next action, as suggested by the AlphaZero paper. The last two tokens are decoded as a value and action logits after a linear layer.
  • Figure 5: The demonstration of the dataset and the generated tangram configuration. We compare the configuration generated by GAG Muzero with the main baseline methods.
  • ...and 1 more figures