Inverse Painting: Reconstructing The Painting Process

Bowei Chen; Yifan Wang; Brian Curless; Ira Kemelmacher-Shlizerman; Steven M. Seitz

Inverse Painting: Reconstructing The Painting Process

Bowei Chen, Yifan Wang, Brian Curless, Ira Kemelmacher-Shlizerman, Steven M. Seitz

TL;DR

This work reconstructs a time-lapse video of how an input painting may have been painted as an autoregressive image generation problem, in which an initially blank “canvas” is iteratively updated with a novel diffusion-based renderer.

Abstract

Given an input painting, we reconstruct a time-lapse video of how it may have been painted. We formulate this as an autoregressive image generation problem, in which an initially blank "canvas" is iteratively updated. The model learns from real artists by training on many painting videos. Our approach incorporates text and region understanding to define a set of painting "instructions" and updates the canvas with a novel diffusion-based renderer. The method extrapolates beyond the limited, acrylic style paintings on which it has been trained, showing plausible results for a wide range of artistic styles and genres.

Inverse Painting: Reconstructing The Painting Process

TL;DR

Abstract

Paper Structure (13 sections, 3 equations, 11 figures, 1 table)

This paper contains 13 sections, 3 equations, 11 figures, 1 table.

Introduction
Related Work
Painting Process Generation
Diffusion Models
Our Method
One-Stage Canvas Rendering Approach
Training: Instruction Generation
Text Instruction Generator
Mask Instruction Generator
Training: Canvas Rendering
Test-Time Generation
Experiments
Discussions

Figures (11)

Figure 1: We present Inverse Painting, a diffusion-based method to generate time-lapse videos of the painting process from a target painting. This figure shows 10 keyframes from the generated painting process for two paintings. By training on acrylic paintings with a similar artistic style to that of the first example in this figure, our method is capable of handling a diverse range of styles (e.g., Van Gogh, above bottom). The resulting videos resemble how human artists typically paint, for example, from back to front, focusing on semantic objects or regions at a time, and employing layering techniques. Images courtesy Catherine Kay Greenup and Rawpixel.
Figure 2: How a real artist paints. Time lapse from a real painting video, representative of the training painting style. The artist uses a back-to-front order with layering techniques, starting with the sky, then clouds, mountains, and other elements. The artist typically focuses on one semantic region at a time.
Figure 3: Method overview. The training has two stages. The instruction generation stage (left two gray boxes) includes the text instruction generator (green) and the mask instruction generator (light orange). These generators produce the text and mask instructions essential for updating the canvas in the next stage. The second stage is canvas rendering (third gray box), where a diffusion-based renderer generates the next image based on multiple conditional signals, such as text and mask instructions. Omitting the text and mask (purple boxes) yields the one-stage method described in Sec. \ref{['sec:naive']}. To simplify the figure, we omit the VAE encoder, CLIP encoder, and text encoder. During testing at step $t-1$ (last gray box), we first generate a text instruction (green arrows), which is then used to create a region mask (orange arrows). Both are then provided to the canvas rendering stage to produce the next image (blue arrows). Image courtesy Catherine Kay Greenup.
Figure 4: Generated text instructions. The sequence of generated text instructions (yellow text and arrows) demonstrate a natural painting order, arranging elements from back to front such as clouds over the sky, flowers over grass, and reflections over water. The "Details" in the right image refers to water texture and small details on the island. Each text instruction may repeat over multiple frames but is displayed only once to simplify this figure. Images courtesy the Art Institute of Chicago and Cleveland Museum of Art.
Figure 5: Effects of conditional signals. (a) shows the current canvas and target image (inset). With only predicted CLIP embeddings of the next image (b), the model generates excessive content per update. Including time intervals (c) properly limits new content volume but results in unnatural mountain rendering. Omitting the mask instruction (d) causes the renderer to complete the mountain area in one step, relying heavily on the text instruction "mountain". Omitting the text instruction (e) results in generating some of the green lake (red arrow) before completing the mountain (mask shown in the inset). The full pipeline (f) updates the canvas at a reasonable pace, drawing the top of the mountain in green, before layering on the yellow region. Image courtesy Catherine Kay Greenup.
...and 6 more figures

Inverse Painting: Reconstructing The Painting Process

TL;DR

Abstract

Inverse Painting: Reconstructing The Painting Process

Authors

TL;DR

Abstract

Table of Contents

Figures (11)