InterDyn: Controllable Interactive Dynamics with Video Diffusion Models
Rick Akkerman, Haiwen Feng, Michael J. Black, Dimitrios Tzionas, Victoria Fernández Abrevaya
TL;DR
InterDyn addresses the challenge of predicting continuous interactive dynamics from a single image by leveraging large video diffusion models as implicit physics engines. It extends a frozen Stable Video Diffusion with a trainable ControlNet-like control branch that encodes a driving motion into the latent space, enabling temporally coherent dynamics guided by the control signal through $T$ denoising timesteps. The approach is validated on synthetic CLEVRER scenes and real-world hand-object interactions (Something-Something-v2), demonstrating force propagation, post-interaction dynamics, and counterfactual scenarios, and it outperforms baselines that focus on static state transitions in both image and video metrics. The results underscore the potential of using generative video models as implicit physics simulators for interactive dynamics, without explicit 3D reconstruction or explicit physics simulation, with broad implications for robotics, planning, and AI-assisted video synthesis.
Abstract
Predicting the dynamics of interacting objects is essential for both humans and intelligent systems. However, existing approaches are limited to simplified, toy settings and lack generalizability to complex, real-world environments. Recent advances in generative models have enabled the prediction of state transitions based on interventions, but focus on generating a single future state which neglects the continuous dynamics resulting from the interaction. To address this gap, we propose InterDyn, a novel framework that generates videos of interactive dynamics given an initial frame and a control signal encoding the motion of a driving object or actor. Our key insight is that large video generation models can act as both neural renderers and implicit physics ``simulators'', having learned interactive dynamics from large-scale video data. To effectively harness this capability, we introduce an interactive control mechanism that conditions the video generation process on the motion of the driving entity. Qualitative results demonstrate that InterDyn generates plausible, temporally consistent videos of complex object interactions while generalizing to unseen objects. Quantitative evaluations show that InterDyn outperforms baselines that focus on static state transitions. This work highlights the potential of leveraging video generative models as implicit physics engines. Project page: https://interdyn.is.tue.mpg.de/
