Table of Contents
Fetching ...

Sketch-Guided Motion Diffusion for Stylized Cinemagraph Synthesis

Hao Jin, Hengyuan Chang, Xiaoxuan Xie, Zhengyang Wang, Xusheng Du, Shaojun Hu, Haoran Xie

TL;DR

Sketch2Cinemagraph presents a novel sketch-guided diffusion framework for stylized cinemagraph synthesis, enabling intuitive control over both content and motion from hand-drawn structure and motion sketches in conjunction with text prompts. The pipeline first generates stylized landscape images and corresponding realistic references, then predicts a latent motion field with a diffusion-based model guided by motion sketches and fluid masks, and finally warps frames using Euler integration and symmetric splatting to produce looping cinemagraphs. Key contributions include a two-stage landscape image generation approach with ControlNet conditioning, a Latent Motion Diffusion Model (LMDM) for sketch-guided flow prediction, and a diffusion-based cinemagraph synthesis step, all evaluated against strong baselines with quantitative improvements in motion-field fidelity and cinemagraph realism. The framework enables broader accessibility to cinemagraph creation by non-experts, offering precise, sketch-driven control and high-quality stylized outputs suitable for artistic and practical applications.

Abstract

Designing stylized cinemagraphs is challenging due to the difficulty in customizing complex and expressive flow motions. To achieve intuitive and detailed control of the generated cinemagraphs, freehand sketches can provide a better solution to convey personalized design requirements than only text inputs. In this paper, we propose Sketch2Cinemagraph, a sketch-guided framework that enables the conditional generation of stylized cinemagraphs from freehand sketches. Sketch2Cinemagraph adopts text prompts for initial content generation and provides hand-drawn sketch controls for both spatial and motion cues. The latent diffusion model is adopted to generate target stylized landscape images along with realistic versions. Then, a pre-trained object detection model is utilized to segment and obtain masks for the flow regions. We proposed a novel latent motion diffusion model to estimate the motion field in the fluid regions of the generated landscape images. The input motion sketches serve as the conditions to control the generated vector fields in the masked fluid regions with the prompt. To synthesize the cinemagraph frames, the pixels within fluid regions are subsequently warped to the target locations for each timestep using a frame generator. The results verified that Sketch2Cinemagraph can generate high-fidelity and aesthetically appealing stylized cinemagraphs with continuous temporal flow from intuitive sketch inputs. We showcase the advantages of Sketch2Cinemagraph through quantitative comparisons against the state-of-the-art generation approaches.

Sketch-Guided Motion Diffusion for Stylized Cinemagraph Synthesis

TL;DR

Sketch2Cinemagraph presents a novel sketch-guided diffusion framework for stylized cinemagraph synthesis, enabling intuitive control over both content and motion from hand-drawn structure and motion sketches in conjunction with text prompts. The pipeline first generates stylized landscape images and corresponding realistic references, then predicts a latent motion field with a diffusion-based model guided by motion sketches and fluid masks, and finally warps frames using Euler integration and symmetric splatting to produce looping cinemagraphs. Key contributions include a two-stage landscape image generation approach with ControlNet conditioning, a Latent Motion Diffusion Model (LMDM) for sketch-guided flow prediction, and a diffusion-based cinemagraph synthesis step, all evaluated against strong baselines with quantitative improvements in motion-field fidelity and cinemagraph realism. The framework enables broader accessibility to cinemagraph creation by non-experts, offering precise, sketch-driven control and high-quality stylized outputs suitable for artistic and practical applications.

Abstract

Designing stylized cinemagraphs is challenging due to the difficulty in customizing complex and expressive flow motions. To achieve intuitive and detailed control of the generated cinemagraphs, freehand sketches can provide a better solution to convey personalized design requirements than only text inputs. In this paper, we propose Sketch2Cinemagraph, a sketch-guided framework that enables the conditional generation of stylized cinemagraphs from freehand sketches. Sketch2Cinemagraph adopts text prompts for initial content generation and provides hand-drawn sketch controls for both spatial and motion cues. The latent diffusion model is adopted to generate target stylized landscape images along with realistic versions. Then, a pre-trained object detection model is utilized to segment and obtain masks for the flow regions. We proposed a novel latent motion diffusion model to estimate the motion field in the fluid regions of the generated landscape images. The input motion sketches serve as the conditions to control the generated vector fields in the masked fluid regions with the prompt. To synthesize the cinemagraph frames, the pixels within fluid regions are subsequently warped to the target locations for each timestep using a frame generator. The results verified that Sketch2Cinemagraph can generate high-fidelity and aesthetically appealing stylized cinemagraphs with continuous temporal flow from intuitive sketch inputs. We showcase the advantages of Sketch2Cinemagraph through quantitative comparisons against the state-of-the-art generation approaches.

Paper Structure

This paper contains 22 sections, 8 equations, 16 figures, 2 tables.

Figures (16)

  • Figure 1: Given the input sketches with motion sketches (gradient greyscale lines) and text prompt, Sketch2Cinemagraph can automatically synthesize stylized cinemagraphs with the generated motion fields. The red font represents the text prompt for flow generation and the blue font for style generation. The gradient greyscale lines depict the flow motion direction. (The generated stylized cinemagraphs are embedded and better viewed using Adobe Reader.)
  • Figure 2: The Workflow of Sketch2Cinemagraph. Given input hand-drawn sketches of landscape structure and motions, the proposed framework can generate landscape cinemagraphs with (a) stylized landscape image generation, (b) sketch-guided motion field prediction, and (c) stylized cinemagraph synthesis steps.
  • Figure 3: Paired Landscape image generation using fine-tuning stable diffusion model. The red text indicates landscape elements, and the blue text represents the provided style. (a) structural sketches; (b) landscape image generated by SD model; (c) photo-realistic landscape image generated by fine-tuned SD model; (d) landscape image in specific style generated by SD model.
  • Figure 4: Two-stage fluid mask extraction results (c) from landscape images (a) using the bounding box (b) as intermediate detection results obtained from the Grounded SAM model.
  • Figure 5: (a) ControlNet for motion sketches encoding; (b) New cross-attention layers for image features. The output $F_{out}$ is fused from text and image embeddings.
  • ...and 11 more figures