Table of Contents
Fetching ...

DynaVid: Learning to Generate Highly Dynamic Videos using Synthetic Motion Data

Wonjoon Jin, Jiyun Won, Janghyeok Han, Qi Dai, Chong Luo, Seung-Hwan Baek, Sunghyun Cho

Abstract

Despite recent progress, video diffusion models still struggle to synthesize realistic videos involving highly dynamic motions or requiring fine-grained motion controllability. A central limitation lies in the scarcity of such examples in commonly used training datasets. To address this, we introduce DynaVid, a video synthesis framework that leverages synthetic motion data in training, which is represented as optical flow and rendered using computer graphics pipelines. This approach offers two key advantages. First, synthetic motion offers diverse motion patterns and precise control signals that are difficult to obtain from real data. Second, unlike rendered videos with artificial appearances, rendered optical flow encodes only motion and is decoupled from appearance, thereby preventing models from reproducing the unnatural look of synthetic videos. Building on this idea, DynaVid adopts a two-stage generation framework: a motion generator first synthesizes motion, and then a motion-guided video generator produces video frames conditioned on that motion. This decoupled formulation enables the model to learn dynamic motion patterns from synthetic data while preserving visual realism from real-world videos. We validate our framework on two challenging scenarios, vigorous human motion generation and extreme camera motion control, where existing datasets are particularly limited. Extensive experiments demonstrate that DynaVid improves the realism and controllability in dynamic motion generation and camera motion control.

DynaVid: Learning to Generate Highly Dynamic Videos using Synthetic Motion Data

Abstract

Despite recent progress, video diffusion models still struggle to synthesize realistic videos involving highly dynamic motions or requiring fine-grained motion controllability. A central limitation lies in the scarcity of such examples in commonly used training datasets. To address this, we introduce DynaVid, a video synthesis framework that leverages synthetic motion data in training, which is represented as optical flow and rendered using computer graphics pipelines. This approach offers two key advantages. First, synthetic motion offers diverse motion patterns and precise control signals that are difficult to obtain from real data. Second, unlike rendered videos with artificial appearances, rendered optical flow encodes only motion and is decoupled from appearance, thereby preventing models from reproducing the unnatural look of synthetic videos. Building on this idea, DynaVid adopts a two-stage generation framework: a motion generator first synthesizes motion, and then a motion-guided video generator produces video frames conditioned on that motion. This decoupled formulation enables the model to learn dynamic motion patterns from synthetic data while preserving visual realism from real-world videos. We validate our framework on two challenging scenarios, vigorous human motion generation and extreme camera motion control, where existing datasets are particularly limited. Extensive experiments demonstrate that DynaVid improves the realism and controllability in dynamic motion generation and camera motion control.

Paper Structure

This paper contains 33 sections, 3 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Examples of video synthesis results for highly dynamic object motion (top) and camera-controlled video generation with rapid viewpoint changes (bottom). Our method produces natural and highly dynamic motions, whereas Wan2.2-5B wan2025wan generates unrealistic motion and GEN3C ren2025gen3c exhibits noticeable visual artifacts. The synthesis results are best viewed in the supplementary video.
  • Figure 2: Overview of our dataset generation pipeline.
  • Figure 3: Overview of DynaVid. (a) The motion generator first synthesizes motion and then produces video frames conditioned on the generated motion. For camera-controlled video synthesis, Plücker embeddings are provided as additional input. (b) Our framework adopts VACE jiang2025vace to incorporate control signals such as Plücker embeddings or optical flow maps.
  • Figure 4: Qualitative comparison of dynamic object motion generation. CogVideoX yang2024cogvideox and Wan2.2-5B wan2025wan often produce distorted or unrealistic human motions with visual artifacts. HyperMotion xu2025hypermotion produces unnatural appearances because it relies on the first frame as input. In contrast, our method generates realistic videos with natural and highly dynamic motions.
  • Figure 5: Qualitative comparison of camera-controlled video synthesis. Red arrows indicate the directions of camera motion. AC3D bahmani2025ac3d fails to follow the extreme 180$^{\circ}$ rotation, while GEN3C ren2025gen3c produces noticeable artifacts in regions unseen from the initial view (zoomed-in red box). In contrast, our method produces natural-looking videos that faithfully follow the input camera trajectory. For fair comparison, the same camera parameters are used for all methods.
  • ...and 3 more figures