Table of Contents
Fetching ...

Learning Physics From Video: Unsupervised Physical Parameter Estimation for Continuous Dynamical Systems

Alejandro Castañeda Garcia, Jan van Gemert, Daan Brinks, Nergis Tömen

TL;DR

This work tackles unsupervised physical parameter estimation from videos governed by known continuous equations, addressing the limitations of frame-reconstruction approaches that are typically restricted to motion and require labels. It proposes a decoder-free architecture that learns latent dynamics via an encoder and a differentiable physics block, optimized with a two-term loss that includes a KL-divergence regularizer to prevent collapse. The authors validate their method on synthetic datasets with damped second-order dynamics and demonstrate robustness to initialization, outperforming baselines that rely on frame prediction or masks. They further introduce Delfys75, a real-world dataset with ground-truth parameters across five dynamical systems, and show competitive parameter recovery without object masks, highlighting practical applicability to real-world video physics. Overall, the method advances unsupervised, decoder-free estimation of physical parameters from diverse video-based dynamical systems and provides a new benchmark for evaluation in Delfys75.

Abstract

Extracting physical dynamical system parameters from recorded observations is key in natural science. Current methods for automatic parameter estimation from video train supervised deep networks on large datasets. Such datasets require labels, which are difficult to acquire. While some unsupervised techniques--which depend on frame prediction--exist, they suffer from long training times, initialization instabilities, only consider motion-based dynamical systems, and are evaluated mainly on synthetic data. In this work, we propose an unsupervised method to estimate the physical parameters of known, continuous governing equations from single videos suitable for different dynamical systems beyond motion and robust to initialization. Moreover, we remove the need for frame prediction by implementing a KL-divergence-based loss function in the latent space, which avoids convergence to trivial solutions and reduces model size and compute. We first evaluate our model on synthetic data, as commonly done. After which, we take the field closer to reality by recording Delfys75: our own real-world dataset of 75 videos for five different types of dynamical systems to evaluate our method and others. Our method compares favorably to others. %, yet, and real-world video datasets and demonstrate improved parameter estimation accuracy compared to existing methods. Code and data are available online:https://github.com/Alejandro-neuro/Learning_physics_from_video.

Learning Physics From Video: Unsupervised Physical Parameter Estimation for Continuous Dynamical Systems

TL;DR

This work tackles unsupervised physical parameter estimation from videos governed by known continuous equations, addressing the limitations of frame-reconstruction approaches that are typically restricted to motion and require labels. It proposes a decoder-free architecture that learns latent dynamics via an encoder and a differentiable physics block, optimized with a two-term loss that includes a KL-divergence regularizer to prevent collapse. The authors validate their method on synthetic datasets with damped second-order dynamics and demonstrate robustness to initialization, outperforming baselines that rely on frame prediction or masks. They further introduce Delfys75, a real-world dataset with ground-truth parameters across five dynamical systems, and show competitive parameter recovery without object masks, highlighting practical applicability to real-world video physics. Overall, the method advances unsupervised, decoder-free estimation of physical parameters from diverse video-based dynamical systems and provides a new benchmark for evaluation in Delfys75.

Abstract

Extracting physical dynamical system parameters from recorded observations is key in natural science. Current methods for automatic parameter estimation from video train supervised deep networks on large datasets. Such datasets require labels, which are difficult to acquire. While some unsupervised techniques--which depend on frame prediction--exist, they suffer from long training times, initialization instabilities, only consider motion-based dynamical systems, and are evaluated mainly on synthetic data. In this work, we propose an unsupervised method to estimate the physical parameters of known, continuous governing equations from single videos suitable for different dynamical systems beyond motion and robust to initialization. Moreover, we remove the need for frame prediction by implementing a KL-divergence-based loss function in the latent space, which avoids convergence to trivial solutions and reduces model size and compute. We first evaluate our model on synthetic data, as commonly done. After which, we take the field closer to reality by recording Delfys75: our own real-world dataset of 75 videos for five different types of dynamical systems to evaluate our method and others. Our method compares favorably to others. %, yet, and real-world video datasets and demonstrate improved parameter estimation accuracy compared to existing methods. Code and data are available online:https://github.com/Alejandro-neuro/Learning_physics_from_video.
Paper Structure (23 sections, 22 equations, 11 figures, 6 tables)

This paper contains 23 sections, 22 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: We propose a novel unsupervised approach to physical parameter estimation from videos. Black squares are video frames with different states of a white pendulum. Starting from a frame at time $t$ (center) an encoder estimates the dynamical states $z_t$. A learnable physics block (Physics ODE) solves the dynamical system equations to predict future states $\hat{z}_{t+1}$ in latent space (blue lines). Previous state-of-the-art methods (left) then use decoders and a reconstruction loss ($\mathcal{L}$, purple left) to train the physics ODE block. In contrast, our method (right) completely avoids the need for a decoder by leveraging a loss function in the latent space ($\mathcal{L}$, blue right). Our loss function minimizes the distance between the estimated states $\hat{z}_{t+1}$ and $z_{t+1}$.
  • Figure 2: Method overview. A video recording of an object with a periodic brightness change (bottom) displays dynamics $z_{\text{real}}$ with sampling period $\delta t$. Each frame is mapped by the encoder $E_\theta$ to the unsupervised latent representation $z_t$. The physics block $P_\gamma$ generates a prediction of the future step $\hat{z}_{t+1}$, which we compare to the encoded representation $z_{t+1}$ of frame $t{+}1$. Top-right: Loss function of our model; the first term ensures the prediction fits with the encoding, while the second expression controls the variance of $z$. This image summarizes our methodology and the relationship between the different blocks.
  • Figure 3: Delfys75 is the first, real-world physical parameter estimation dataset. Top to bottom: first ($t_0$), second ($t_1$), middle ($t_i$), and last frames ($t_n$). Specified at the bottom are the estimated parameters in each scenario. Note the complex shadows, shading, and realistic lighting conditions, in a natural environment.
  • Figure 4: (a) Latent space estimation of the dynamic variable $z$ for the three synthetic datasets. The blue line shows the 'ground truth' value $z_{\text{real}}$ of the simulated dynamics. The model was trained with the dynamics of the continuous line while the dashed lines show the ground truth (blue) and predictions (yellow, red, green) on the extrapolated test set. (b) Parameter estimation accuracy. Rows 1-3: mean $\pm$ standard deviation of each learnable parameter in the physics block after training, bottom row: ground truth (GT). The values are obtained over 7 different runs with different initializations. We observe good agreement between the predicted and ground truth dynamics.
  • Figure 5: Example frames from the synthetic datasets. Each row shows a different dataset, corresponding to a different continuous dynamical system, and each column a different time sample.
  • ...and 6 more figures