Loomis Painter: Reconstructing the Painting Process
Markus Pobitzer, Chang Liu, Chenyi Zhuang, Teng Long, Bin Ren, Nicu Sebe
TL;DR
Loomis Painter tackles the challenge of turning static reference images into faithful, multi-media painting processes by using a diffusion-based video model conditioned on medium-aware semantics and trained with a reverse-painting objective. The approach enables cross-media transfer, temporal coherence, and high-fidelity final frames, aided by a large, occlusion-cleaned dataset of real painting workflows and a novel PDP metric to quantify progression from composition to detail. Key contributions include medium-aware semantic embedding, cross-media structural alignment, a reverse-painting learning strategy, and a dataset with comprehensive pipeline tooling for occlusion removal. Quantitative and qualitative results show improvements over state-of-the-art baselines (LPIPS, DINO, CLIP, FID) and demonstrate realistic, human-aligned painting sequences across media such as acrylic, oil, pencil, and Loomis-style drawings.
Abstract
Step-by-step painting tutorials are vital for learning artistic techniques, but existing video resources (e.g., YouTube) lack interactivity and personalization. While recent generative models have advanced artistic image synthesis, they struggle to generalize across media and often show temporal or structural inconsistencies, hindering faithful reproduction of human creative workflows. To address this, we propose a unified framework for multi-media painting process generation with a semantics-driven style control mechanism that embeds multiple media into a diffusion models conditional space and uses cross-medium style augmentation. This enables consistent texture evolution and process transfer across styles. A reverse-painting training strategy further ensures smooth, human-aligned generation. We also build a large-scale dataset of real painting processes and evaluate cross-media consistency, temporal coherence, and final-image fidelity, achieving strong results on LPIPS, DINO, and CLIP metrics. Finally, our Perceptual Distance Profile (PDP) curve quantitatively models the creative sequence, i.e., composition, color blocking, and detail refinement, mirroring human artistic progression.
