Table of Contents
Fetching ...

Degrees of Freedom Matter: Inferring Dynamics from Point Trajectories

Yan Zhang, Sergey Prokudin, Marko Mihajlovic, Qianli Ma, Siyu Tang

TL;DR

This work introduces DOMA, a compact implicit motion model that learns spatiotemporal affine motion fields for generic 3D scenes by encoding time as an input to a SIREN-based network. DOMA extends the prior Dynamic Point Field (DPF) from two-frame deformations to continuous multi-frame dynamics, achieving temporal smoothness through wave-equation-inspired regularization and augmenting representation power with additional output degrees of freedom in the affine part. The key theoretical insight is a Jacobian-based analysis showing how extra DOFs in the output layer increase local motion complexity without expanding hidden layers, enabling a much smaller, yet expressive model (e.g., $16d + nd^2$ parameters versus frame-wise $(6d+nd^2)(T-1)$). Empirically, DOMA variants—especially DOMA-Affinity—show superior novel point motion prediction and temporal mesh alignment on challenging datasets like DeformingThings4D and Resynth, with significantly smaller model sizes (around 200 KB vs. 8 MB) and improved temporal regularity, validating the approach and its potential for robust, priors-free motion modeling in dynamic scenes.

Abstract

Understanding the dynamics of generic 3D scenes is fundamentally challenging in computer vision, essential in enhancing applications related to scene reconstruction, motion tracking, and avatar creation. In this work, we address the task as the problem of inferring dense, long-range motion of 3D points. By observing a set of point trajectories, we aim to learn an implicit motion field parameterized by a neural network to predict the movement of novel points within the same domain, without relying on any data-driven or scene-specific priors. To achieve this, our approach builds upon the recently introduced dynamic point field model that learns smooth deformation fields between the canonical frame and individual observation frames. However, temporal consistency between consecutive frames is neglected, and the number of required parameters increases linearly with the sequence length due to per-frame modeling. To address these shortcomings, we exploit the intrinsic regularization provided by SIREN, and modify the input layer to produce a spatiotemporally smooth motion field. Additionally, we analyze the motion field Jacobian matrix, and discover that the motion degrees of freedom (DOFs) in an infinitesimal area around a point and the network hidden variables have different behaviors to affect the model's representational power. This enables us to improve the model representation capability while retaining the model compactness. Furthermore, to reduce the risk of overfitting, we introduce a regularization term based on the assumption of piece-wise motion smoothness. Our experiments assess the model's performance in predicting unseen point trajectories and its application in temporal mesh alignment with guidance. The results demonstrate its superiority and effectiveness. The code and data for the project are publicly available: \url{https://yz-cnsdqz.github.io/eigenmotion/DOMA/}

Degrees of Freedom Matter: Inferring Dynamics from Point Trajectories

TL;DR

This work introduces DOMA, a compact implicit motion model that learns spatiotemporal affine motion fields for generic 3D scenes by encoding time as an input to a SIREN-based network. DOMA extends the prior Dynamic Point Field (DPF) from two-frame deformations to continuous multi-frame dynamics, achieving temporal smoothness through wave-equation-inspired regularization and augmenting representation power with additional output degrees of freedom in the affine part. The key theoretical insight is a Jacobian-based analysis showing how extra DOFs in the output layer increase local motion complexity without expanding hidden layers, enabling a much smaller, yet expressive model (e.g., parameters versus frame-wise ). Empirically, DOMA variants—especially DOMA-Affinity—show superior novel point motion prediction and temporal mesh alignment on challenging datasets like DeformingThings4D and Resynth, with significantly smaller model sizes (around 200 KB vs. 8 MB) and improved temporal regularity, validating the approach and its potential for robust, priors-free motion modeling in dynamic scenes.

Abstract

Understanding the dynamics of generic 3D scenes is fundamentally challenging in computer vision, essential in enhancing applications related to scene reconstruction, motion tracking, and avatar creation. In this work, we address the task as the problem of inferring dense, long-range motion of 3D points. By observing a set of point trajectories, we aim to learn an implicit motion field parameterized by a neural network to predict the movement of novel points within the same domain, without relying on any data-driven or scene-specific priors. To achieve this, our approach builds upon the recently introduced dynamic point field model that learns smooth deformation fields between the canonical frame and individual observation frames. However, temporal consistency between consecutive frames is neglected, and the number of required parameters increases linearly with the sequence length due to per-frame modeling. To address these shortcomings, we exploit the intrinsic regularization provided by SIREN, and modify the input layer to produce a spatiotemporally smooth motion field. Additionally, we analyze the motion field Jacobian matrix, and discover that the motion degrees of freedom (DOFs) in an infinitesimal area around a point and the network hidden variables have different behaviors to affect the model's representational power. This enables us to improve the model representation capability while retaining the model compactness. Furthermore, to reduce the risk of overfitting, we introduce a regularization term based on the assumption of piece-wise motion smoothness. Our experiments assess the model's performance in predicting unseen point trajectories and its application in temporal mesh alignment with guidance. The results demonstrate its superiority and effectiveness. The code and data for the project are publicly available: \url{https://yz-cnsdqz.github.io/eigenmotion/DOMA/}
Paper Structure (42 sections, 1 theorem, 25 equations, 9 figures, 15 tables)

This paper contains 42 sections, 1 theorem, 25 equations, 9 figures, 15 tables.

Key Result

Theorem 1

Provided and in which $\circ$ is the composition of element-wise multiplication and broadcasting a vector to a matrix, as well as $i=\{0,1,...,n-1\}$, the bound of the spectral norm of $\nabla {\bm u}$ is given by in which $n$ and $d$ denote the number of hidden layers and the dimension of hidden layers, respectively.

Figures (9)

  • Figure 1: We introduce DOMA, a compact implicit motion model designed to capture generic dynamics of 3D scenes. By processing a 3D point $\bm x$ in the canonical frame alongside a 1D time step $t$, DOMA predicts an affine mapping, parameterized by a linear map ${\bm A}_\theta$ and a translation vector ${\bm u}_\theta$. By leveraging the inherent regularity of the utilized SIREN framework sitzmann2019siren, DOMA ensures the generation of a spatiotemporally smooth motion field. The model’s capacity to represent complex dynamics can be controlled by adjusting the degrees of freedom of the output affine mapping.
  • Figure 2: Qualitative results on the Synthetic sequences. The smoothness regularization is applied. Rows show types of motions, and columns show the testing points in the canonical frame, a target frame, and estimated results, respectively. See more descriptions in Sec. \ref{['sec:exp1']} and Sec. \ref{['sec:supp:more_on_synthetic']} in supp. mat.
  • Figure 3: Illustration of results on two Resynth sequences. From left to right: The source mesh in the canonical frame, two consecutive frames of the target scans, the results from frame-wise DPF and DOMA-Affinity, respectively. Both AIAP prokudin2023dynamic and smoothness regularization are applied. The bounding boxes highlight significant changes.
  • Figure A1: Illustration of the DOMA model architecture. The SIREN layers sitzmann2020implicit produce an affine transformation, which maps the point from ${\bm x}$ to ${\bm y}$ at time $t$.
  • Figure A2: Illustrations of results on the Synthetic sequences. The smoothness regularization is applied. Rows show types of motions, and columns show the testing points in the canonical frame, a target frame, and estimated results from different methods, respectively.
  • ...and 4 more figures

Theorems & Definitions (2)

  • Theorem 1
  • proof