Table of Contents
Fetching ...

ShapeShift: Towards Text-to-Shape Arrangement Synthesis with Content-Aware Geometric Constraints

Vihaan Misra, Peter Schaldenbrand, Jean Oh

TL;DR

ShapeShift tackles semantic shape arrangement by constraining a fixed set of geometric primitives while guiding placement with diffusion priors. It uses minimal shape parameterization $P_i=(x_i,y_i,α_i)$, differentiable vector-graphics rendering, and multi-scale SDS to align layouts with text prompts, augmented by a content-aware collision resolution that blends geometric corrections with semantic guidance. The approach yields interpretable, collision-free compositions and outperforms pixel-based baselines in both geometric validity and semantic coherence, as confirmed by ablations and human studies. This framework bridges high-level semantic understanding with explicit geometric constraints, enabling controllable, physically plausible designs for education, design, and human-computer interaction.

Abstract

While diffusion-based models excel at generating photorealistic images from text, a more nuanced challenge emerges when constrained to using only a fixed set of rigid shapes, akin to solving tangram puzzles or arranging real-world objects to match semantic descriptions. We formalize this problem as shape-based image generation, a new text-guided image-to-image translation task that requires rearranging the input set of rigid shapes into non-overlapping configurations and visually communicating the target concept. Unlike pixel-manipulation approaches, our method, ShapeShift, explicitly parameterizes each shape within a differentiable vector graphics pipeline, iteratively optimizing placement and orientation through score distillation sampling from pretrained diffusion models. To preserve arrangement clarity, we introduce a content-aware collision resolution mechanism that applies minimal semantically coherent adjustments when overlaps occur, ensuring smooth convergence toward physically valid configurations. By bridging diffusion-based semantic guidance with explicit geometric constraints, our approach yields interpretable compositions where spatial relationships clearly embody the textual prompt. Extensive experiments demonstrate compelling results across diverse scenarios, with quantitative and qualitative advantages over alternative techniques.

ShapeShift: Towards Text-to-Shape Arrangement Synthesis with Content-Aware Geometric Constraints

TL;DR

ShapeShift tackles semantic shape arrangement by constraining a fixed set of geometric primitives while guiding placement with diffusion priors. It uses minimal shape parameterization , differentiable vector-graphics rendering, and multi-scale SDS to align layouts with text prompts, augmented by a content-aware collision resolution that blends geometric corrections with semantic guidance. The approach yields interpretable, collision-free compositions and outperforms pixel-based baselines in both geometric validity and semantic coherence, as confirmed by ablations and human studies. This framework bridges high-level semantic understanding with explicit geometric constraints, enabling controllable, physically plausible designs for education, design, and human-computer interaction.

Abstract

While diffusion-based models excel at generating photorealistic images from text, a more nuanced challenge emerges when constrained to using only a fixed set of rigid shapes, akin to solving tangram puzzles or arranging real-world objects to match semantic descriptions. We formalize this problem as shape-based image generation, a new text-guided image-to-image translation task that requires rearranging the input set of rigid shapes into non-overlapping configurations and visually communicating the target concept. Unlike pixel-manipulation approaches, our method, ShapeShift, explicitly parameterizes each shape within a differentiable vector graphics pipeline, iteratively optimizing placement and orientation through score distillation sampling from pretrained diffusion models. To preserve arrangement clarity, we introduce a content-aware collision resolution mechanism that applies minimal semantically coherent adjustments when overlaps occur, ensuring smooth convergence toward physically valid configurations. By bridging diffusion-based semantic guidance with explicit geometric constraints, our approach yields interpretable compositions where spatial relationships clearly embody the textual prompt. Extensive experiments demonstrate compelling results across diverse scenarios, with quantitative and qualitative advantages over alternative techniques.

Paper Structure

This paper contains 20 sections, 8 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Various shape arrangements produced by Sh$\Delta$peShift. Given a goal concept and an input image containing a set of arbitrary shapes, the task is to generate an image of the same set of shapes rearranged to match the textual concept, e.g., "Horse," "Tree," or "Crown," without pieces overlapping each other.
  • Figure 2: A Lack of Physical Constraints in Image Generators. People cleverly use a small set of blocks to create semantically-rich arrangements with Tangram puzzles. While Stable Diffusion esser2024stable-diffusion3 generates semantically-rich Tangram arrangements, these are invalid due to not using the available blocks or having physically impossible overlaps. This illustrates the gap between pixel-based generation and physically constrained arrangement tasks.
  • Figure 3: ShapeShift Overview Our method iteratively optimizes the positions and orientations of a given arrangement of objects to match a given language description and obey physical constraints. The process begins by extracting shape parameters $P$ (position $x_k, y_k$ and orientation $\alpha_k$) from the input arrangement using SAM ravi2024sam2segmentimages. These parameters are used to render the current canvas via DiffVG. The goal concept is encoded through CLIP, enabling Multi-Scale Score Distillation Sampling that generates gradients to update shape parameters ($\Delta P$). Simultaneously, our Content-Aware Collision Resolution Module utilizes GPT to identify semantic regions in the arrangement and employs geometric adjustments with SAT (Separating Axis Theorem) to ensure shapes remain physically valid while respecting semantic relationships.
  • Figure 4: Multi-Scale Rendering. (a) High-resolution render capturing fine details and precise edge alignments;(b) Low-resolution render emphasizing global layout and overall shape placement; This provides for a robust semantic guidance and avoids overfitting to pixel-level noise. We use aggregated SDS loss computed by averaging losses across scales, ensuring that both coarse and fine semantic features guide the optimization.
  • Figure 5: Content-Aware Collision Resolution Process for the goal concept "Sword" (a) Plain collision resolution relies solely on geometric penetration, often producing unnatural displacements. (b) The initial render state before collision detection. (c) Our content-aware method integrates extracted semantic concepts and dynamic attention weights to preserve object orientations and contextual relationships (e.g., maintaining a sword's natural alignment).
  • ...and 2 more figures