Table of Contents
Fetching ...

MotionCFG: Boosting Motion Dynamics via Stochastic Concept Perturbation

Byungjun Kim, Soobin Um, Jong Chul Ye

Abstract

Despite recent advances in Text-to-Video (T2V) synthesis, generating high-fidelity and dynamic motion remains a significant challenge. Existing methods primarily rely on Classifier-Free Guidance (CFG), often with explicit negative prompts (e.g. "static", "blurry"), to suppress undesired artifacts. However, such explicit negations frequently introduce unintended semantic bias and distort object integrity; a phenomenon we define as Content-Motion Drift. To address this, we propose MotionCFG, a framework that enhances motion dynamics by contrasting a target concept with its noise-perturbed counterparts. Specifically, by injecting Gaussian noise into the concept embeddings, MotionCFG creates localized negative anchors that encapsulate a broad complementary space of sub-optimal motion variations. Unlike explicit negations, this approach facilitates implicit hard negative mining without shifting the global semantic identity, allowing for a focused refinement of temporal details. Combined with a piecewise guidance schedule that confines intervention to the early denoising steps, MotionCFG consistently improves motion dynamics across state-of-the-art T2V frameworks with negligible computational overhead and minimal compromise in visual quality. Additionally, we demonstrate that this noise-induced contrastive mechanism is effective not only for sharpening motion trajectories but also for steering complex, non-linear concepts such as precise object numerosity, which are typically difficult to modulate via standard text-based guidance.

MotionCFG: Boosting Motion Dynamics via Stochastic Concept Perturbation

Abstract

Despite recent advances in Text-to-Video (T2V) synthesis, generating high-fidelity and dynamic motion remains a significant challenge. Existing methods primarily rely on Classifier-Free Guidance (CFG), often with explicit negative prompts (e.g. "static", "blurry"), to suppress undesired artifacts. However, such explicit negations frequently introduce unintended semantic bias and distort object integrity; a phenomenon we define as Content-Motion Drift. To address this, we propose MotionCFG, a framework that enhances motion dynamics by contrasting a target concept with its noise-perturbed counterparts. Specifically, by injecting Gaussian noise into the concept embeddings, MotionCFG creates localized negative anchors that encapsulate a broad complementary space of sub-optimal motion variations. Unlike explicit negations, this approach facilitates implicit hard negative mining without shifting the global semantic identity, allowing for a focused refinement of temporal details. Combined with a piecewise guidance schedule that confines intervention to the early denoising steps, MotionCFG consistently improves motion dynamics across state-of-the-art T2V frameworks with negligible computational overhead and minimal compromise in visual quality. Additionally, we demonstrate that this noise-induced contrastive mechanism is effective not only for sharpening motion trajectories but also for steering complex, non-linear concepts such as precise object numerosity, which are typically difficult to modulate via standard text-based guidance.
Paper Structure (20 sections, 10 equations, 8 figures, 11 tables, 1 algorithm)

This paper contains 20 sections, 10 equations, 8 figures, 11 tables, 1 algorithm.

Figures (8)

  • Figure 1: Showcase of MotionCFG. While Standard CFG (left) produces over-saturated colors and static outputs that often fail to reflect the intended actions, MotionCFG (right) resolves motion ambiguity by selectively sharpening motion-related embeddings, thereby yielding realistic and physically dynamic videos faithful to the prompt. Motion words are highlighted in green.
  • Figure 2: Overview of the MotionCFG pipeline.Step 1: Motion-related tokens in the prompt are identified via an LLM. Step 2: Gaussian noise ($\delta_t$) is injected exclusively into these motion text embeddings to generate a perturbed condition ($c_{pert,t}$). Step 3: A piecewise guidance schedule applies MotionCFG during the early sampling steps to establish robust motion trajectories, before reverting to standard CFG to refine spatial details.
  • Figure 3: Qualitative comparison on Wan2.1. Each row shows uniformly sampled frames from a generated video. "Baseline" denotes negative-prompted CFG. While existing approaches produce near-static outputs or visual artifacts, MotionCFG generates physically plausible dynamics faithful to the highlighted motion tokens (green).
  • Figure 4: Trade-off analysis between text fidelity and motion dynamics. Each point corresponds to a different hyperparameter choice (e.g., ratio $\tau$). MotionCFG (red) achieves higher motion scores (FLOW, Dino Segm Dist, DEVIL) with minimal X-CLIP degradation, while baselines either sacrifice fidelity or fail to improve dynamics.
  • Figure 5: Qualitative comparison of subject count consistency. While the Interval Guidance (a) hallucinated an additional horse, our method (b) consistently maintained the requested count of three horses throughout the video.
  • ...and 3 more figures