Table of Contents
Fetching ...

Disentangling Foreground and Background Motion for Enhanced Realism in Human Video Generation

Jinlin Liu, Kai Yu, Mengyang Feng, Xiefan Guo, Miaomiao Cui

TL;DR

The paper addresses the realism gap in human video generation by disentangling foreground and background motion into pose-guided foreground and sparse-tracking background representations. It leverages a latent diffusion framework with dual encoders and a masking strategy, enabling coherent foreground actions alongside authentic background dynamics. To produce longer videos without error accumulation, it introduces clip-by-clip generation with condition concatenation and global feature injection, reinforced by a Temporal Motion Block for temporal coherence. Evaluations on TikTok and a newly collected Human-5000 dataset demonstrate superior foreground-background harmony and sequence stability compared to prior methods, with ablations validating the contribution of each component. The approach advances realistic video synthesis and offers a scalable path toward extended, coherent human-centric videos in real-world settings.

Abstract

Recent advancements in human video synthesis have enabled the generation of high-quality videos through the application of stable diffusion models. However, existing methods predominantly concentrate on animating solely the human element (the foreground) guided by pose information, while leaving the background entirely static. Contrary to this, in authentic, high-quality videos, backgrounds often dynamically adjust in harmony with foreground movements, eschewing stagnancy. We introduce a technique that concurrently learns both foreground and background dynamics by segregating their movements using distinct motion representations. Human figures are animated leveraging pose-based motion, capturing intricate actions. Conversely, for backgrounds, we employ sparse tracking points to model motion, thereby reflecting the natural interaction between foreground activity and environmental changes. Training on real-world videos enhanced with this innovative motion depiction approach, our model generates videos exhibiting coherent movement in both foreground subjects and their surrounding contexts. To further extend video generation to longer sequences without accumulating errors, we adopt a clip-by-clip generation strategy, introducing global features at each step. To ensure seamless continuity across these segments, we ingeniously link the final frame of a produced clip with input noise to spawn the succeeding one, maintaining narrative flow. Throughout the sequential generation process, we infuse the feature representation of the initial reference image into the network, effectively curtailing any cumulative color inconsistencies that may otherwise arise. Empirical evaluations attest to the superiority of our method in producing videos that exhibit harmonious interplay between foreground actions and responsive background dynamics, surpassing prior methodologies in this regard.

Disentangling Foreground and Background Motion for Enhanced Realism in Human Video Generation

TL;DR

The paper addresses the realism gap in human video generation by disentangling foreground and background motion into pose-guided foreground and sparse-tracking background representations. It leverages a latent diffusion framework with dual encoders and a masking strategy, enabling coherent foreground actions alongside authentic background dynamics. To produce longer videos without error accumulation, it introduces clip-by-clip generation with condition concatenation and global feature injection, reinforced by a Temporal Motion Block for temporal coherence. Evaluations on TikTok and a newly collected Human-5000 dataset demonstrate superior foreground-background harmony and sequence stability compared to prior methods, with ablations validating the contribution of each component. The approach advances realistic video synthesis and offers a scalable path toward extended, coherent human-centric videos in real-world settings.

Abstract

Recent advancements in human video synthesis have enabled the generation of high-quality videos through the application of stable diffusion models. However, existing methods predominantly concentrate on animating solely the human element (the foreground) guided by pose information, while leaving the background entirely static. Contrary to this, in authentic, high-quality videos, backgrounds often dynamically adjust in harmony with foreground movements, eschewing stagnancy. We introduce a technique that concurrently learns both foreground and background dynamics by segregating their movements using distinct motion representations. Human figures are animated leveraging pose-based motion, capturing intricate actions. Conversely, for backgrounds, we employ sparse tracking points to model motion, thereby reflecting the natural interaction between foreground activity and environmental changes. Training on real-world videos enhanced with this innovative motion depiction approach, our model generates videos exhibiting coherent movement in both foreground subjects and their surrounding contexts. To further extend video generation to longer sequences without accumulating errors, we adopt a clip-by-clip generation strategy, introducing global features at each step. To ensure seamless continuity across these segments, we ingeniously link the final frame of a produced clip with input noise to spawn the succeeding one, maintaining narrative flow. Throughout the sequential generation process, we infuse the feature representation of the initial reference image into the network, effectively curtailing any cumulative color inconsistencies that may otherwise arise. Empirical evaluations attest to the superiority of our method in producing videos that exhibit harmonious interplay between foreground actions and responsive background dynamics, surpassing prior methodologies in this regard.
Paper Structure (19 sections, 2 equations, 6 figures, 2 tables)

This paper contains 19 sections, 2 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Our proposed method features a pipeline that meticulously models foreground motions with pose guidance and captures background dynamics via a sparse tracking system. To tackle the challenge of video synthesis beyond training sample lengths, we use a sequential strategy, generating video clips that build on the last frame of the previous clip to maintain continuity. A global feature vector derived from the initial image ensures visual coherence across the iterative process, preventing color or context inconsistencies. Additionally, a Temporal Motion Block ensures smooth transitions and temporal consistency between frames, enhancing realism and fluidity in the generated sequences.
  • Figure 2: We separately extract body poses and tracking points from the reference video, with body poses encapsulating the movements of the foreground subjects and tracking points signifying the dynamics of the background. In order to address any potential overlap between these two distinct representations, we employ the inverse of the foreground mask, which is multiplied with the extracted tracking points. This operation effectively removes any overlap, ensuring a clean separation of foreground and background motion elements.
  • Figure 3: Inputting previously unseen data, our method's qualitative performance highlights its success in generating videos that seamlessly integrate foreground and background motion.
  • Figure 4: On the TikTok dataset, qualitative analysis reveals that MagicAnimate xu2023magicanimate outputs suffer from color distortion, while AnimateAnyone's hu2023animate also show detail inconsistencies like face and text. Our method notably enhances these aspects.
  • Figure 5: When compared with AnimateAnyone (hu2023animate) in Human-5000 datasets featuring background motion, our method excels with superior background movement generation capabilities.
  • ...and 1 more figures