Table of Contents
Fetching ...

Loomis Painter: Reconstructing the Painting Process

Markus Pobitzer, Chang Liu, Chenyi Zhuang, Teng Long, Bin Ren, Nicu Sebe

TL;DR

Loomis Painter tackles the challenge of turning static reference images into faithful, multi-media painting processes by using a diffusion-based video model conditioned on medium-aware semantics and trained with a reverse-painting objective. The approach enables cross-media transfer, temporal coherence, and high-fidelity final frames, aided by a large, occlusion-cleaned dataset of real painting workflows and a novel PDP metric to quantify progression from composition to detail. Key contributions include medium-aware semantic embedding, cross-media structural alignment, a reverse-painting learning strategy, and a dataset with comprehensive pipeline tooling for occlusion removal. Quantitative and qualitative results show improvements over state-of-the-art baselines (LPIPS, DINO, CLIP, FID) and demonstrate realistic, human-aligned painting sequences across media such as acrylic, oil, pencil, and Loomis-style drawings.

Abstract

Step-by-step painting tutorials are vital for learning artistic techniques, but existing video resources (e.g., YouTube) lack interactivity and personalization. While recent generative models have advanced artistic image synthesis, they struggle to generalize across media and often show temporal or structural inconsistencies, hindering faithful reproduction of human creative workflows. To address this, we propose a unified framework for multi-media painting process generation with a semantics-driven style control mechanism that embeds multiple media into a diffusion models conditional space and uses cross-medium style augmentation. This enables consistent texture evolution and process transfer across styles. A reverse-painting training strategy further ensures smooth, human-aligned generation. We also build a large-scale dataset of real painting processes and evaluate cross-media consistency, temporal coherence, and final-image fidelity, achieving strong results on LPIPS, DINO, and CLIP metrics. Finally, our Perceptual Distance Profile (PDP) curve quantitatively models the creative sequence, i.e., composition, color blocking, and detail refinement, mirroring human artistic progression.

Loomis Painter: Reconstructing the Painting Process

TL;DR

Loomis Painter tackles the challenge of turning static reference images into faithful, multi-media painting processes by using a diffusion-based video model conditioned on medium-aware semantics and trained with a reverse-painting objective. The approach enables cross-media transfer, temporal coherence, and high-fidelity final frames, aided by a large, occlusion-cleaned dataset of real painting workflows and a novel PDP metric to quantify progression from composition to detail. Key contributions include medium-aware semantic embedding, cross-media structural alignment, a reverse-painting learning strategy, and a dataset with comprehensive pipeline tooling for occlusion removal. Quantitative and qualitative results show improvements over state-of-the-art baselines (LPIPS, DINO, CLIP, FID) and demonstrate realistic, human-aligned painting sequences across media such as acrylic, oil, pencil, and Loomis-style drawings.

Abstract

Step-by-step painting tutorials are vital for learning artistic techniques, but existing video resources (e.g., YouTube) lack interactivity and personalization. While recent generative models have advanced artistic image synthesis, they struggle to generalize across media and often show temporal or structural inconsistencies, hindering faithful reproduction of human creative workflows. To address this, we propose a unified framework for multi-media painting process generation with a semantics-driven style control mechanism that embeds multiple media into a diffusion models conditional space and uses cross-medium style augmentation. This enables consistent texture evolution and process transfer across styles. A reverse-painting training strategy further ensures smooth, human-aligned generation. We also build a large-scale dataset of real painting processes and evaluate cross-media consistency, temporal coherence, and final-image fidelity, achieving strong results on LPIPS, DINO, and CLIP metrics. Finally, our Perceptual Distance Profile (PDP) curve quantitatively models the creative sequence, i.e., composition, color blocking, and detail refinement, mirroring human artistic progression.

Paper Structure

This paper contains 25 sections, 7 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Loomis Painter: Our method reconstructs the painting process of any input image, either faithfully, as shown in (b), or in different art media, as in (a) and (c). The title of our work is inspired by the Loomis portrait method, which we also enable. Images with green borders are input reference images; all others are generated by our method.
  • Figure 2: Overview of our painting process generation method. The curated video is first reversed to better align with the underlying video generation model. We LoRA-tune WAN 2.1 wan2025wan, a video generation model conditioned on an input image and a prompt. In our case, the input image corresponds to a painting, and the model learns to reconstruct the steps to paint it, starting from a finished painting to a blank canvas. In (b), the media transfer model is shown, which enables the video generation model to render any input image as an acrylic, oil, or pencil painting based on the text input. To achieve this, we generate variations of the reference image using image editing models and train the video generation model to reconstruct the original painting process.
  • Figure 3: We visualize the noisy video latent and the image latent prior to channel-wise concatenation. Empty boxes indicate the padded frames with zeros. Notably, the natural painting order (a) exhibits poor temporal alignment with the video latents.
  • Figure 4: Dataset Curation Pipeline Overview. Our framework extracts painting workflows from raw tutorial videos. First, start and end frames are detected, and the painting canvas is localized. The video is then partitioned into $N$ segments, from which $M$ frames are sampled per segment. Subsequently, occlusions (e.g., hands, brushes) are detected in the sampled frames, and a masked median is computed over the sampled frames, using the preceding median frame as a reference to remove transient obstructions. Finally, logos and text overlays are detected and removed (not shown in the figure), producing $N$ occlusion-free frames. Image courtesy of Samir Godinjak, from the Painting with Samir YouTube channel.
  • Figure 5: Comparison using the same input image (bottom right). Columns 1--4 show samples from the art media transfer model; column 5 shows the base model output. As the base model’s final frame closely matches the input, only the input image is shown. For the base model, we employed the standard prompt. The last row shows the final frame of each method.
  • ...and 7 more figures