Table of Contents
Fetching ...

InterDyn: Controllable Interactive Dynamics with Video Diffusion Models

Rick Akkerman, Haiwen Feng, Michael J. Black, Dimitrios Tzionas, Victoria Fernández Abrevaya

TL;DR

InterDyn addresses the challenge of predicting continuous interactive dynamics from a single image by leveraging large video diffusion models as implicit physics engines. It extends a frozen Stable Video Diffusion with a trainable ControlNet-like control branch that encodes a driving motion into the latent space, enabling temporally coherent dynamics guided by the control signal through $T$ denoising timesteps. The approach is validated on synthetic CLEVRER scenes and real-world hand-object interactions (Something-Something-v2), demonstrating force propagation, post-interaction dynamics, and counterfactual scenarios, and it outperforms baselines that focus on static state transitions in both image and video metrics. The results underscore the potential of using generative video models as implicit physics simulators for interactive dynamics, without explicit 3D reconstruction or explicit physics simulation, with broad implications for robotics, planning, and AI-assisted video synthesis.

Abstract

Predicting the dynamics of interacting objects is essential for both humans and intelligent systems. However, existing approaches are limited to simplified, toy settings and lack generalizability to complex, real-world environments. Recent advances in generative models have enabled the prediction of state transitions based on interventions, but focus on generating a single future state which neglects the continuous dynamics resulting from the interaction. To address this gap, we propose InterDyn, a novel framework that generates videos of interactive dynamics given an initial frame and a control signal encoding the motion of a driving object or actor. Our key insight is that large video generation models can act as both neural renderers and implicit physics ``simulators'', having learned interactive dynamics from large-scale video data. To effectively harness this capability, we introduce an interactive control mechanism that conditions the video generation process on the motion of the driving entity. Qualitative results demonstrate that InterDyn generates plausible, temporally consistent videos of complex object interactions while generalizing to unseen objects. Quantitative evaluations show that InterDyn outperforms baselines that focus on static state transitions. This work highlights the potential of leveraging video generative models as implicit physics engines. Project page: https://interdyn.is.tue.mpg.de/

InterDyn: Controllable Interactive Dynamics with Video Diffusion Models

TL;DR

InterDyn addresses the challenge of predicting continuous interactive dynamics from a single image by leveraging large video diffusion models as implicit physics engines. It extends a frozen Stable Video Diffusion with a trainable ControlNet-like control branch that encodes a driving motion into the latent space, enabling temporally coherent dynamics guided by the control signal through denoising timesteps. The approach is validated on synthetic CLEVRER scenes and real-world hand-object interactions (Something-Something-v2), demonstrating force propagation, post-interaction dynamics, and counterfactual scenarios, and it outperforms baselines that focus on static state transitions in both image and video metrics. The results underscore the potential of using generative video models as implicit physics simulators for interactive dynamics, without explicit 3D reconstruction or explicit physics simulation, with broad implications for robotics, planning, and AI-assisted video synthesis.

Abstract

Predicting the dynamics of interacting objects is essential for both humans and intelligent systems. However, existing approaches are limited to simplified, toy settings and lack generalizability to complex, real-world environments. Recent advances in generative models have enabled the prediction of state transitions based on interventions, but focus on generating a single future state which neglects the continuous dynamics resulting from the interaction. To address this gap, we propose InterDyn, a novel framework that generates videos of interactive dynamics given an initial frame and a control signal encoding the motion of a driving object or actor. Our key insight is that large video generation models can act as both neural renderers and implicit physics ``simulators'', having learned interactive dynamics from large-scale video data. To effectively harness this capability, we introduce an interactive control mechanism that conditions the video generation process on the motion of the driving entity. Qualitative results demonstrate that InterDyn generates plausible, temporally consistent videos of complex object interactions while generalizing to unseen objects. Quantitative evaluations show that InterDyn outperforms baselines that focus on static state transitions. This work highlights the potential of leveraging video generative models as implicit physics engines. Project page: https://interdyn.is.tue.mpg.de/

Paper Structure

This paper contains 14 sections, 4 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: State transition vs. dynamics. Methods that generate static state transitions (i.e. predict a future image) such as CosHand sudhakar2024coshand struggle to capture the inherent dynamic processes involved in human-object interactions. Here, we show a video sequence where the motion continues beyond the interaction.
  • Figure 2: Overview of InterDyn. Given an input image depicting a scene, such as a hand holding a remote, and a "driving motion," such as a sequence of binary hand masks, InterDyn generates a video depicting plausible hand and object dynamics. Crucially, InterDyn receives no control signal for the object. Through this setup, we probe and assess the implicit knowledge of large video generation models on complex interactive dynamics. We use Stable Video Diffusion (SVD) as our frozen backbone and fine-tune a separate control signal encoder. Videos are iteratively denoised over $T$ timesteps, starting from Gaussian noise $\boldsymbol{\epsilon} \sim \mathcal{N}(0, I)$.
  • Figure 3: Qualitative investigation on the CLEVRER dataset. Given an input image and the "driving" motion of one or two objects, our model predicts the future interactive dynamics of multiple elements in the scene. The driving motion is given in the form of semantic mask sequences. The generated object motions are highlighted with a red-line trajectory. Note that our model can generate videos with force propagation across multiple uncontrolled objects (top) and can generate multiple futures (bottom). Zoom in for details.
  • Figure 4: Qualitative comparison. A two-state approach such as CosHand sudhakar2024coshand struggles with post-interaction object dynamics.
  • Figure 5: Robustness to noise. SAM2 outputs noisy/coarse masks for frames with considerable motion blur (orange/red). Despite this, InterDyn can generate plausible hand and object dynamics.
  • ...and 4 more figures