Table of Contents
Fetching ...

Cross-Modal Diffusion for Biomechanical Dynamical Systems Through Local Manifold Alignment

Sharmita Dey, Sarath Ravindran Nair

TL;DR

This work tackles cross-modal biomechanical time-series generation by framing X (e.g., joint angles) and Y (e.g., GRFs) as observations of a shared dynamical process. It introduces a mutually aligned diffusion framework that trains conditional diffusion models $p_ heta(oldsymbol{X}|oldsymbol{Y})$ and $p_oldsymbol{ heta}(oldsymbol{Y}|oldsymbol{X})$ and enforces local latent manifold alignment (LLMA) through first-order sequence-contrastive and second-order covariance constraints on latent windows. The LLMA objective, combined with standard denoising and energy conservation terms, improves cross-modal generation fidelity and yields more informative latent representations, as shown by lower MSE, lower FID, better predictive scores, and stronger downstream classification. The approach enables robust cross-modal inference under missing or noisy sensor conditions and has potential applications in wearable robotics, rehabilitation, and broader time-series domains with shared underlying dynamics. It also provides a principled dynamical-systems perspective for aligning modalities across diffusion steps.

Abstract

We present a mutually aligned diffusion framework for cross-modal biomechanical motion generation, guided by a dynamical systems perspective. By treating each modality, e.g., observed joint angles ($X$) and ground reaction forces ($Y$), as complementary observations of a shared underlying locomotor dynamical system, our method aligns latent representations at each diffusion step, so that one modality can help denoise and disambiguate the other. Our alignment approach is motivated by the fact that local time windows of $X$ and $Y$ represent the same phase of an underlying dynamical system, thereby benefiting from a shared latent manifold. We introduce a simple local latent manifold alignment (LLMA) strategy that incorporates first-order and second-order alignment within the latent space for robust cross-modal biomechanical generation without bells and whistles. Through experiments on multimodal human biomechanics data, we show that aligning local latent dynamics across modalities improves generation fidelity and yields better representations.

Cross-Modal Diffusion for Biomechanical Dynamical Systems Through Local Manifold Alignment

TL;DR

This work tackles cross-modal biomechanical time-series generation by framing X (e.g., joint angles) and Y (e.g., GRFs) as observations of a shared dynamical process. It introduces a mutually aligned diffusion framework that trains conditional diffusion models and and enforces local latent manifold alignment (LLMA) through first-order sequence-contrastive and second-order covariance constraints on latent windows. The LLMA objective, combined with standard denoising and energy conservation terms, improves cross-modal generation fidelity and yields more informative latent representations, as shown by lower MSE, lower FID, better predictive scores, and stronger downstream classification. The approach enables robust cross-modal inference under missing or noisy sensor conditions and has potential applications in wearable robotics, rehabilitation, and broader time-series domains with shared underlying dynamics. It also provides a principled dynamical-systems perspective for aligning modalities across diffusion steps.

Abstract

We present a mutually aligned diffusion framework for cross-modal biomechanical motion generation, guided by a dynamical systems perspective. By treating each modality, e.g., observed joint angles () and ground reaction forces (), as complementary observations of a shared underlying locomotor dynamical system, our method aligns latent representations at each diffusion step, so that one modality can help denoise and disambiguate the other. Our alignment approach is motivated by the fact that local time windows of and represent the same phase of an underlying dynamical system, thereby benefiting from a shared latent manifold. We introduce a simple local latent manifold alignment (LLMA) strategy that incorporates first-order and second-order alignment within the latent space for robust cross-modal biomechanical generation without bells and whistles. Through experiments on multimodal human biomechanics data, we show that aligning local latent dynamics across modalities improves generation fidelity and yields better representations.

Paper Structure

This paper contains 39 sections, 15 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: (Left) Mutually aligned cross-modal diffusion with latent manifold alignment. Diffusion processes, $p_\theta(\mathbf{X}|\mathbf{Y})$ and $p_\phi(\mathbf{Y}|\mathbf{X})$, generate data for modalities, $\mathbf{X}$ and $\mathbf{Y}$, respectively, guided by a condition derived from the other modality. During training, the latent representations, $h_X(\mathbf{X}_t, t)$ and $h_Y(\mathbf{Y}_t, t)$, of the two models are aligned using a local latent manifold alignment (LLMA) objective. Additionally, denoising and energy conservation objectives are applied to each modality's generated samples, $\hat{\mathbf{X}}$ and $\hat{\mathbf{Y}}$. During sampling, the model for each modality diffuses a noise signal across $T$ timesteps, guided by a condition from the other modality to generate samples of a given modality that temporally corresponds to the guiding signal.
  • Figure 2: Comparison of real and generated trajectories using models trained with and without latent alignment of diffusion models. Latent alignment improves the quality of generated samples. The shaded region represents the standard deviation.
  • Figure 3: (Top) Visualization of latent embeddings of the models $p(X|Y)$ and $p(Y|X)$ on a held-out subject data trained without and with latent alignment. The samples are color-coded by locomotion task label. The latent representations of the models trained without alignment show high modality-specific separation in latent space. On the other hand, latent representations of the models trained with alignment show a merge of the latent spaces for both modalities with samples belonging to the same task occupying overlapping subspaces. (Bottom left) The correlation between latent representations of the two modality-specific models on held-out test data. Models trained with alignment show a high correlation between the latent spaces of the two modalities. (Bottom right) Performance of a linear classifier in discriminating the locomotion tasks from the latent representation of modality-specific models. Latent representations from models trained with alignment give better accuracy emphasizing a clearer separation of tasks in their latent spaces.
  • Figure 4: Real (black) and sampled (red) trajectories of joint angles (top) and joint moments (bottom) generated by latent aligned cross-modal diffusion models. All the generated trajectories follow the ground truth trajectories closely. Shaded region represents standard deviation.
  • Figure 5: Example failure cases of the model for the prediction of the three modalities. Failure cases mostly occur when the underlying true signal shows high variability, or due to sign changes in the sampled signals.
  • ...and 1 more figures