Table of Contents
Fetching ...

AnimatePainter: A Self-Supervised Rendering Framework for Reconstructing Painting Process

Junjie Hu, Shuyong Gao, Qianyu Guo, Yan Wang, Qishan Wang, Yuang Feng, Wenqiang Zhang

TL;DR

AnimatePainter tackles the challenge of generating painting processes from arbitrary images without real drawing data by reframing the task as video generation. It combines a self-supervised data synthesis pipeline with a depth-guided, diffusion-based video generator, enhanced by a DF-Encoder that injects hierarchical depth information into cross-attention. Key contributions include a scalable self-supervised data generation method, depth-guided layering for process planning, and an end-to-end painting generator that produces coherent, human-like painting sequences. The approach yields realistic process videos across painting styles and demonstrates strong performance against baselines, offering a practical pathway for education, robotic painting, and creative AI applications where real process data is scarce.

Abstract

Humans can intuitively decompose an image into a sequence of strokes to create a painting, yet existing methods for generating drawing processes are limited to specific data types and often rely on expensive human-annotated datasets. We propose a novel self-supervised framework for generating drawing processes from any type of image, treating the task as a video generation problem. Our approach reverses the drawing process by progressively removing strokes from a reference image, simulating a human-like creation sequence. Crucially, our method does not require costly datasets of real human drawing processes; instead, we leverage depth estimation and stroke rendering to construct a self-supervised dataset. We model human drawings as "refinement" and "layering" processes and introduce depth fusion layers to enable video generation models to learn and replicate human drawing behavior. Extensive experiments validate the effectiveness of our approach, demonstrating its ability to generate realistic drawings without the need for real drawing process data.

AnimatePainter: A Self-Supervised Rendering Framework for Reconstructing Painting Process

TL;DR

AnimatePainter tackles the challenge of generating painting processes from arbitrary images without real drawing data by reframing the task as video generation. It combines a self-supervised data synthesis pipeline with a depth-guided, diffusion-based video generator, enhanced by a DF-Encoder that injects hierarchical depth information into cross-attention. Key contributions include a scalable self-supervised data generation method, depth-guided layering for process planning, and an end-to-end painting generator that produces coherent, human-like painting sequences. The approach yields realistic process videos across painting styles and demonstrates strong performance against baselines, offering a practical pathway for education, robotic painting, and creative AI applications where real process data is scarce.

Abstract

Humans can intuitively decompose an image into a sequence of strokes to create a painting, yet existing methods for generating drawing processes are limited to specific data types and often rely on expensive human-annotated datasets. We propose a novel self-supervised framework for generating drawing processes from any type of image, treating the task as a video generation problem. Our approach reverses the drawing process by progressively removing strokes from a reference image, simulating a human-like creation sequence. Crucially, our method does not require costly datasets of real human drawing processes; instead, we leverage depth estimation and stroke rendering to construct a self-supervised dataset. We model human drawings as "refinement" and "layering" processes and introduce depth fusion layers to enable video generation models to learn and replicate human drawing behavior. Extensive experiments validate the effectiveness of our approach, demonstrating its ability to generate realistic drawings without the need for real drawing process data.

Paper Structure

This paper contains 17 sections, 7 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Our self-supervised method does not require expensive real painting process data, can quickly synthesize a large amount of painting data, and supports any SBR backbone.
  • Figure 2: Samples of dataset. Some key frames of the generated data. The first two rows are based on Frida frida(ink painting and colorful painting); the third row is based on PaintTransformer PaintTransformer.
  • Figure 3: Pipeline of AnimatePainter. The input image $I$ will be used in two parts: (1)generating a depth map using a depth estimation model for hierarchical control. (2)rendering into strokes through the SBR model. In the layering stage, the depth map will be layered according to manual rules to obtain ${\{D_t\}}_{t=1}^T$. Similarly, it will also generate a layered image sequence ${\{M_t\}}_{t=1}^T$ based on the painting rules of human artists. After being encoded by VAE, ${D_t\}}_{t=1}^T$ enters the DF-Encoder to inject hierarchical control into the diffusion model. The entire painting process is generated in one step from end to end. And the strokes image $I_s$ is generated using PaintTransformer PaintTransformer as backbone.
  • Figure 4: Samples of evaluation dataset. Including oil paintings, landscape images, portraits, and more.
  • Figure 5: Generated Painting Process. At the beginning of painting, AnimatePainter fills in the basic background color and then focuses on depicting the rough outline of the object, generating only relatively blurry content at this stage. Subsequently, the model begins to refine the details and enhance the sharpness of the painting. Meanwhile, the model adds details from far to near during the painting process, which aligns with our previous discussion.
  • ...and 3 more figures