Table of Contents
Fetching ...

LoopAnimate: Loopable Salient Object Animation

Fanyi Wang, Peng Liu, Haotian Hu, Dan Meng, Jingwen Su, Jinjin Xu, Yanhao Zhang, Xiaoming Ren, Zhiwang Zhang

TL;DR

LoopAnimate addresses the challenge of long, loopable video generation with high object fidelity by decoupling multi-level image appearance and textual semantics and introducing a three-stage training strategy. It leverages a pretrained image-to-image diffusion model, An asymmetric loop sampling strategy, and a Temporal Enhanced Motion Module (TEMM) within a Multi-level Image representation and Textual semantics Decoupling Framework (MITDF) to extend one-shot generation to 35 frames. Key contributions include ALSS data preparation, latent-space conditioning with inverted noise and FGSM masks, and selective cross-attention injections that optimize motion in the middle network blocks. Experiments show state-of-the-art fidelity and temporal consistency, with subjective studies confirming improved loopability, making LoopAnimate practical for loopable content like animated wallpapers and long-form clips.

Abstract

Research on diffusion model-based video generation has advanced rapidly. However, limitations in object fidelity and generation length hinder its practical applications. Additionally, specific domains like animated wallpapers require seamless looping, where the first and last frames of the video match seamlessly. To address these challenges, this paper proposes LoopAnimate, a novel method for generating videos with consistent start and end frames. To enhance object fidelity, we introduce a framework that decouples multi-level image appearance and textual semantic information. Building upon an image-to-image diffusion model, our approach incorporates both pixel-level and feature-level information from the input image, injecting image appearance and textual semantic embeddings at different positions of the diffusion model. Existing UNet-based video generation models require to input the entire videos during training to encode temporal and positional information at once. However, due to limitations in GPU memory, the number of frames is typically restricted to 16. To address this, this paper proposes a three-stage training strategy with progressively increasing frame numbers and reducing fine-tuning modules. Additionally, we introduce the Temporal E nhanced Motion Module(TEMM) to extend the capacity for encoding temporal and positional information up to 36 frames. The proposed LoopAnimate, which for the first time extends the single-pass generation length of UNet-based video generation models to 35 frames while maintaining high-quality video generation. Experiments demonstrate that LoopAnimate achieves state-of-the-art performance in both objective metrics, such as fidelity and temporal consistency, and subjective evaluation results.

LoopAnimate: Loopable Salient Object Animation

TL;DR

LoopAnimate addresses the challenge of long, loopable video generation with high object fidelity by decoupling multi-level image appearance and textual semantics and introducing a three-stage training strategy. It leverages a pretrained image-to-image diffusion model, An asymmetric loop sampling strategy, and a Temporal Enhanced Motion Module (TEMM) within a Multi-level Image representation and Textual semantics Decoupling Framework (MITDF) to extend one-shot generation to 35 frames. Key contributions include ALSS data preparation, latent-space conditioning with inverted noise and FGSM masks, and selective cross-attention injections that optimize motion in the middle network blocks. Experiments show state-of-the-art fidelity and temporal consistency, with subjective studies confirming improved loopability, making LoopAnimate practical for loopable content like animated wallpapers and long-form clips.

Abstract

Research on diffusion model-based video generation has advanced rapidly. However, limitations in object fidelity and generation length hinder its practical applications. Additionally, specific domains like animated wallpapers require seamless looping, where the first and last frames of the video match seamlessly. To address these challenges, this paper proposes LoopAnimate, a novel method for generating videos with consistent start and end frames. To enhance object fidelity, we introduce a framework that decouples multi-level image appearance and textual semantic information. Building upon an image-to-image diffusion model, our approach incorporates both pixel-level and feature-level information from the input image, injecting image appearance and textual semantic embeddings at different positions of the diffusion model. Existing UNet-based video generation models require to input the entire videos during training to encode temporal and positional information at once. However, due to limitations in GPU memory, the number of frames is typically restricted to 16. To address this, this paper proposes a three-stage training strategy with progressively increasing frame numbers and reducing fine-tuning modules. Additionally, we introduce the Temporal E nhanced Motion Module(TEMM) to extend the capacity for encoding temporal and positional information up to 36 frames. The proposed LoopAnimate, which for the first time extends the single-pass generation length of UNet-based video generation models to 35 frames while maintaining high-quality video generation. Experiments demonstrate that LoopAnimate achieves state-of-the-art performance in both objective metrics, such as fidelity and temporal consistency, and subjective evaluation results.
Paper Structure (21 sections, 15 equations, 4 figures, 5 tables)

This paper contains 21 sections, 15 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Generation results of our LoopAnimate with given reference images and text prompts.
  • Figure 2: Illustration of LoopAnimate. Arrows for different training stages are with different colors.
  • Figure 3: Visualization comparison with sota open source and commercial methods.
  • Figure 4: Visualization results of different inject position of image and text embedding. Down/Mid/Up represents down-sample/middle/up-sample block respectively. IE and TE represents image embedding and text embedding respectively.