Sketch-Guided Motion Diffusion for Stylized Cinemagraph Synthesis
Hao Jin, Hengyuan Chang, Xiaoxuan Xie, Zhengyang Wang, Xusheng Du, Shaojun Hu, Haoran Xie
TL;DR
Sketch2Cinemagraph presents a novel sketch-guided diffusion framework for stylized cinemagraph synthesis, enabling intuitive control over both content and motion from hand-drawn structure and motion sketches in conjunction with text prompts. The pipeline first generates stylized landscape images and corresponding realistic references, then predicts a latent motion field with a diffusion-based model guided by motion sketches and fluid masks, and finally warps frames using Euler integration and symmetric splatting to produce looping cinemagraphs. Key contributions include a two-stage landscape image generation approach with ControlNet conditioning, a Latent Motion Diffusion Model (LMDM) for sketch-guided flow prediction, and a diffusion-based cinemagraph synthesis step, all evaluated against strong baselines with quantitative improvements in motion-field fidelity and cinemagraph realism. The framework enables broader accessibility to cinemagraph creation by non-experts, offering precise, sketch-driven control and high-quality stylized outputs suitable for artistic and practical applications.
Abstract
Designing stylized cinemagraphs is challenging due to the difficulty in customizing complex and expressive flow motions. To achieve intuitive and detailed control of the generated cinemagraphs, freehand sketches can provide a better solution to convey personalized design requirements than only text inputs. In this paper, we propose Sketch2Cinemagraph, a sketch-guided framework that enables the conditional generation of stylized cinemagraphs from freehand sketches. Sketch2Cinemagraph adopts text prompts for initial content generation and provides hand-drawn sketch controls for both spatial and motion cues. The latent diffusion model is adopted to generate target stylized landscape images along with realistic versions. Then, a pre-trained object detection model is utilized to segment and obtain masks for the flow regions. We proposed a novel latent motion diffusion model to estimate the motion field in the fluid regions of the generated landscape images. The input motion sketches serve as the conditions to control the generated vector fields in the masked fluid regions with the prompt. To synthesize the cinemagraph frames, the pixels within fluid regions are subsequently warped to the target locations for each timestep using a frame generator. The results verified that Sketch2Cinemagraph can generate high-fidelity and aesthetically appealing stylized cinemagraphs with continuous temporal flow from intuitive sketch inputs. We showcase the advantages of Sketch2Cinemagraph through quantitative comparisons against the state-of-the-art generation approaches.
