Cross-Modal Diffusion for Biomechanical Dynamical Systems Through Local Manifold Alignment
Sharmita Dey, Sarath Ravindran Nair
TL;DR
This work tackles cross-modal biomechanical time-series generation by framing X (e.g., joint angles) and Y (e.g., GRFs) as observations of a shared dynamical process. It introduces a mutually aligned diffusion framework that trains conditional diffusion models $p_ heta(oldsymbol{X}|oldsymbol{Y})$ and $p_oldsymbol{ heta}(oldsymbol{Y}|oldsymbol{X})$ and enforces local latent manifold alignment (LLMA) through first-order sequence-contrastive and second-order covariance constraints on latent windows. The LLMA objective, combined with standard denoising and energy conservation terms, improves cross-modal generation fidelity and yields more informative latent representations, as shown by lower MSE, lower FID, better predictive scores, and stronger downstream classification. The approach enables robust cross-modal inference under missing or noisy sensor conditions and has potential applications in wearable robotics, rehabilitation, and broader time-series domains with shared underlying dynamics. It also provides a principled dynamical-systems perspective for aligning modalities across diffusion steps.
Abstract
We present a mutually aligned diffusion framework for cross-modal biomechanical motion generation, guided by a dynamical systems perspective. By treating each modality, e.g., observed joint angles ($X$) and ground reaction forces ($Y$), as complementary observations of a shared underlying locomotor dynamical system, our method aligns latent representations at each diffusion step, so that one modality can help denoise and disambiguate the other. Our alignment approach is motivated by the fact that local time windows of $X$ and $Y$ represent the same phase of an underlying dynamical system, thereby benefiting from a shared latent manifold. We introduce a simple local latent manifold alignment (LLMA) strategy that incorporates first-order and second-order alignment within the latent space for robust cross-modal biomechanical generation without bells and whistles. Through experiments on multimodal human biomechanics data, we show that aligning local latent dynamics across modalities improves generation fidelity and yields better representations.
