Table of Contents
Fetching ...

Teaching an Agent to Sketch One Part at a Time

Xiaodan Du, Ruize Xu, David Yunis, Yael Vinker, Greg Shakhnarovich

Abstract

We develop a method for producing vector sketches one part at a time. To do this, we train a multi-modal language model-based agent using a novel multi-turn process-reward reinforcement learning following supervised fine-tuning. Our approach is enabled by a new dataset we call ControlSketch-Part, containing rich part-level annotations for sketches, obtained using a novel, generic automatic annotation pipeline that segments vector sketches into semantic parts and assigns paths to parts with a structured multi-stage labeling process. Our results indicate that incorporating structured part-level data and providing agent with the visual feedback through the process enables interpretable, controllable, and locally editable text-to-vector sketch generation.

Teaching an Agent to Sketch One Part at a Time

Abstract

We develop a method for producing vector sketches one part at a time. To do this, we train a multi-modal language model-based agent using a novel multi-turn process-reward reinforcement learning following supervised fine-tuning. Our approach is enabled by a new dataset we call ControlSketch-Part, containing rich part-level annotations for sketches, obtained using a novel, generic automatic annotation pipeline that segments vector sketches into semantic parts and assigns paths to parts with a structured multi-stage labeling process. Our results indicate that incorporating structured part-level data and providing agent with the visual feedback through the process enables interpretable, controllable, and locally editable text-to-vector sketch generation.
Paper Structure (23 sections, 10 equations, 25 figures, 4 tables, 1 algorithm)

This paper contains 23 sections, 10 equations, 25 figures, 4 tables, 1 algorithm.

Figures (25)

  • Figure 1: Progressive vector sketch generation using our VLM agent. Trained on our new dataset via SFT + RL training, our agent generates sketches part-by-part, conditioned on text instructions and the evolving canvas. It produces diverse, structurally plausible sketches and supports localized editing via arbitrary stroke removal and replacement.
  • Figure 2: An illustration of our automated part annotation pipeline. The same VLM is used to produce part designations and assignments in some stages and to critique these assignments and suggest improvements in other stages. Green check marks indicate outputs retained in the final dataset.
  • Figure 3: Examples from the ControlSketch-Part dataset. We show part decomposition for 4 sketches with various objects and number of parts. The actual caption and part descriptions are shown for the rightmost sketch. The black text is the overall caption. The color-coded part descriptions and stroke groups demonstrate the part-level semantic annotations.
  • Figure 4: The visualization of the training pipeline. The task of generating vector sketches based on text prompts is split into multiple turns. Blue arrows: sequential computation; red arrows: loss. Cross-entropy loss and DreamSim reward are used at training signal at SFT and RL stages, respectively. $\pi_\theta$ is the policy model, i.e., our VLM.
  • Figure 5: The Long-CLIP cosine similarity across all tested models. The Ground Truth (GT) value and the Random value are the cosine similarity scores of text to the ground truth sketches from ControlSketch-Part and sketches of randomly sampled paths.
  • ...and 20 more figures