Table of Contents
Fetching ...

Integrating Multimodal Data for Joint Generative Modeling of Complex Dynamics

Manuel Brenner, Florian Hess, Georgia Koppe, Daniel Durstewitz

TL;DR

The paper tackles reconstructing complex dynamical systems from multimodal, often non-Gaussian time series by introducing Multimodal Teacher Forcing (MTF), a framework that couples a multimodal variational autoencoder (MVAE) with a dendritic piecewise linear RNN (dendPLRNN) via shared decoders. MT F uses the MVAE to generate a sparse, data-informed teacher signal that guides training of the DSR model, yielding a fully generative latent dynamics that preserves geometry and long-term behavior. Across synthetic chaotic systems (Lorenz-63, Rössler, Lewis-Glass) and real neural data (fMRI+behavior, hippocampal spike trains with position), MT F outperforms competing strategies (SVAE, BPTT-based, and multiple shooting) and enables DS reconstruction from ordinal and symbolic data while handling missing modalities. The framework’s modularity and demonstrated success in cross-modal inference and symbolic dynamics suggest broad applicability to scientific domains where multimodal measurements are available but difficult to model jointly.

Abstract

Many, if not most, systems of interest in science are naturally described as nonlinear dynamical systems. Empirically, we commonly access these systems through time series measurements. Often such time series may consist of discrete random variables rather than continuous measurements, or may be composed of measurements from multiple data modalities observed simultaneously. For instance, in neuroscience we may have behavioral labels in addition to spike counts and continuous physiological recordings. While by now there is a burgeoning literature on deep learning for dynamical systems reconstruction (DSR), multimodal data integration has hardly been considered in this context. Here we provide such an efficient and flexible algorithmic framework that rests on a multimodal variational autoencoder for generating a sparse teacher signal that guides training of a reconstruction model, exploiting recent advances in DSR training techniques. It enables to combine various sources of information for optimal reconstruction, even allows for reconstruction from symbolic data (class labels) alone, and connects different types of observations within a common latent dynamics space. In contrast to previous multimodal data integration techniques for scientific applications, our framework is fully \textit{generative}, producing, after training, trajectories with the same geometrical and temporal structure as those of the ground truth system.

Integrating Multimodal Data for Joint Generative Modeling of Complex Dynamics

TL;DR

The paper tackles reconstructing complex dynamical systems from multimodal, often non-Gaussian time series by introducing Multimodal Teacher Forcing (MTF), a framework that couples a multimodal variational autoencoder (MVAE) with a dendritic piecewise linear RNN (dendPLRNN) via shared decoders. MT F uses the MVAE to generate a sparse, data-informed teacher signal that guides training of the DSR model, yielding a fully generative latent dynamics that preserves geometry and long-term behavior. Across synthetic chaotic systems (Lorenz-63, Rössler, Lewis-Glass) and real neural data (fMRI+behavior, hippocampal spike trains with position), MT F outperforms competing strategies (SVAE, BPTT-based, and multiple shooting) and enables DS reconstruction from ordinal and symbolic data while handling missing modalities. The framework’s modularity and demonstrated success in cross-modal inference and symbolic dynamics suggest broad applicability to scientific domains where multimodal measurements are available but difficult to model jointly.

Abstract

Many, if not most, systems of interest in science are naturally described as nonlinear dynamical systems. Empirically, we commonly access these systems through time series measurements. Often such time series may consist of discrete random variables rather than continuous measurements, or may be composed of measurements from multiple data modalities observed simultaneously. For instance, in neuroscience we may have behavioral labels in addition to spike counts and continuous physiological recordings. While by now there is a burgeoning literature on deep learning for dynamical systems reconstruction (DSR), multimodal data integration has hardly been considered in this context. Here we provide such an efficient and flexible algorithmic framework that rests on a multimodal variational autoencoder for generating a sparse teacher signal that guides training of a reconstruction model, exploiting recent advances in DSR training techniques. It enables to combine various sources of information for optimal reconstruction, even allows for reconstruction from symbolic data (class labels) alone, and connects different types of observations within a common latent dynamics space. In contrast to previous multimodal data integration techniques for scientific applications, our framework is fully \textit{generative}, producing, after training, trajectories with the same geometrical and temporal structure as those of the ground truth system.
Paper Structure (49 sections, 38 equations, 19 figures, 6 tables)

This paper contains 49 sections, 38 equations, 19 figures, 6 tables.

Figures (19)

  • Figure 1: MTF setup. Multimodal observations are translated via an encoder into a common latent representation, which is used for sparse TF in the DSR model's latent space. The latent trajectory is then mapped back into observation space via modality-specific decoder models, which are shared between the MVAE and DSR model.
  • Figure 2: DS reconstruction from moderately (a) and heavily (b-c) distorted continuous observations (Gaussian observation noise of $10 \%$ and $50 \%$, respectively, of the data variance) and other simultaneously provided observation modalities, sampled from a Lorenz-63 system. a: Freely generated example trajectories from a dendPLRNN ($M=20, B=10, K=20, \tau=10$) trained with MTF jointly on Gaussian ($10 \%$ noise), ordinal, and count data. b: Same as a for a dendPLRNN trained by MTF on heavily distorted Gaussian ($50 \%$ noise) and ordinal observations. Note that even in this case the butterfly wing structure of the Lorenz attractor was successfully captured. The maximum Lyapunov exponent ($\lambda_{\text{max}}$) furthermore confirms the dendPLRNN-generated attractor is chaotic (for the GT Lorenz system, $\lambda_{\text{max}} \approx 0.903$). c: Normalized cumulative histograms of geometrical attractor disagreement ($D_{stsp}$, left) and Hellinger distance ($D_{H}$, right) between reconstructed and ground-truth system for the same setting as in b.
  • Figure 3: Cross-modal inference with missing observations, using a mixture-of-experts encoder. a: Reconstruction of the Lorenz-63 from Gaussian and ordinal observations, with $20\%$ of data removed at random times chosen independently for each modality. b: Using only the Gaussian expert, the corresponding ordinal observations can be decoded almost perfectly, including at times missing in the ordinal training data. Dashed lines indicate sections with missing data in a and b.
  • Figure 4: DS reconstruction from discrete observations by MTF ($M=30, B=15, K=30, \tau=10$). Top: Reconstruction of Rössler attractor from only ordinal time series. Bottom: DS reconstruction from symbolic coding of Lorenz attractor (see Fig. \ref{['fig:symbolic_lorenz_categories']} for true and predicted class label probabilities). Ground truth systems and their ordinal/symbolic encoding are on the left, corresponding reconstructions on the right. Note that in both cases the topology and general geometry are preserved, and maximal Lyapunov exponents closely match those of the true systems (Rössler: $\lambda_{\text{max}}^{\text{true}} \approx 0.072$, Lorenz: $\lambda_{\text{max}}^{\text{true}} \approx 0.903$). TDE = temporal delay embedding.
  • Figure 5: a: Example reconstructions of spike trains and spatial location of animal. Top: Spike train data from a test set not used for model training (topmost), and model-generated spike trains (below) simulated from a data-inferred initial condition. Bottom: True and model-predicted spatial position on the test set. b, top: Correlation of various spike train statistics between test set and model-generated data (blue), and -- for comparison -- between experimental training and test set data (orange) : mean spike rate, zero count ratio, coefficient of variation, and correlation between cross-correlation coefficients. Diagonal gray lines are bisectrices, not regression lines. Bottom: Cross-correlation matrices among all neurons for the experimental test set (left) and model-generated spike data (right). c: Joint DSR from both spatial and neural data significantly improves reconstructions compared to just neural data alone ($* \, p<0.05$, $*** \, p<0.001$). d: DSR model latent space (shown is a subspace), illustrating how the latent dynamics is organized according to the animal's spatial position (color-coded).
  • ...and 14 more figures