Table of Contents
Fetching ...

Integrating Random Effects in Variational Autoencoders for Dimensionality Reduction of Correlated Data

Giora Simchoni, Saharon Rosset

TL;DR

VAEs assume IID observations, which limits performance on correlated datasets. LMMVAE addresses this by splitting the latent space into a fixed part $\mathbf{U}$ and a correlated random part $\mathbf{B}$, with a design matrix $\mathbf{Z}$ producing $\mathbf{Z}\mathbf{B}$ in the generative model, yielding $\mathbf{X} \approx f(\mathbf{U}) + \mathbf{Z}\mathbf{B} + \mathcal{E}$ and a modified ELBO that includes two KL terms. The framework generalizes to high-cardinality categorical data, longitudinal measurements, and spatial locations via appropriate covariance structures (e.g., matrix-normal, Phi, and kernel $\mathbf{K}$) and BLUP-style updates. Across extensive simulations and real datasets, LMMVAE achieves lower reconstruction error and NLL on unseen data and yields more informative latent representations for downstream tasks, outperforming several state-of-the-art alternatives. This work enables scalable, principled handling of structured correlation in large tabular and image datasets, enhancing representation learning and predictive performance in practical settings.

Abstract

Variational Autoencoders (VAE) are widely used for dimensionality reduction of large-scale tabular and image datasets, under the assumption of independence between data observations. In practice, however, datasets are often correlated, with typical sources of correlation including spatial, temporal and clustering structures. Inspired by the literature on linear mixed models (LMM), we propose LMMVAE -- a novel model which separates the classic VAE latent model into fixed and random parts. While the fixed part assumes the latent variables are independent as usual, the random part consists of latent variables which are correlated between similar clusters in the data such as nearby locations or successive measurements. The classic VAE architecture and loss are modified accordingly. LMMVAE is shown to improve squared reconstruction error and negative likelihood loss significantly on unseen data, with simulated as well as real datasets from various applications and correlation scenarios. It also shows improvement in the performance of downstream tasks such as supervised classification on the learned representations.

Integrating Random Effects in Variational Autoencoders for Dimensionality Reduction of Correlated Data

TL;DR

VAEs assume IID observations, which limits performance on correlated datasets. LMMVAE addresses this by splitting the latent space into a fixed part and a correlated random part , with a design matrix producing in the generative model, yielding and a modified ELBO that includes two KL terms. The framework generalizes to high-cardinality categorical data, longitudinal measurements, and spatial locations via appropriate covariance structures (e.g., matrix-normal, Phi, and kernel ) and BLUP-style updates. Across extensive simulations and real datasets, LMMVAE achieves lower reconstruction error and NLL on unseen data and yields more informative latent representations for downstream tasks, outperforming several state-of-the-art alternatives. This work enables scalable, principled handling of structured correlation in large tabular and image datasets, enhancing representation learning and predictive performance in practical settings.

Abstract

Variational Autoencoders (VAE) are widely used for dimensionality reduction of large-scale tabular and image datasets, under the assumption of independence between data observations. In practice, however, datasets are often correlated, with typical sources of correlation including spatial, temporal and clustering structures. Inspired by the literature on linear mixed models (LMM), we propose LMMVAE -- a novel model which separates the classic VAE latent model into fixed and random parts. While the fixed part assumes the latent variables are independent as usual, the random part consists of latent variables which are correlated between similar clusters in the data such as nearby locations or successive measurements. The classic VAE architecture and loss are modified accordingly. LMMVAE is shown to improve squared reconstruction error and negative likelihood loss significantly on unseen data, with simulated as well as real datasets from various applications and correlation scenarios. It also shows improvement in the performance of downstream tasks such as supervised classification on the learned representations.

Paper Structure

This paper contains 26 sections, 13 equations, 5 figures, 30 tables.

Figures (5)

  • Figure 1: LMMVAE architecture: data $\mathbf{X}$ enters two separate FE and RE encoders (or a single encoder with double output), to produce the fixed LV $\mathbf{u}$ and RE $\mathbf{b}$ by the reparameterization trick. $\mathbf{u}$ goes through the FE decoder, its output $f(\mathbf{u})$ is added the RE term $\mathbf{Z}\mathbf{B}$ after $\mathbf{Z}$ enters the model and multiplies the RE matrix $\mathbf{B}$ after it had been properly formed from the $\mathbf{b}$ RE vectors, as depicted by the $\odot$ symbol (see the different covariance scenarios for more details). This produces the final reconstructions $\mathbf{\hat{X}}$.
  • Figure 2: Predicted vs. true scatter plots for simulated datasets with $n = 100000$ observations. First row: first column of $\mathbf{B}_1, B_2, B_3$. Second row: LV $U$ (here $d = 2$). Third row: first 3 columns of $\mathbf{X}_{te}$. A: Three high-cardinality categorical features, with $q_1 = 1000, q_2 = 3000, q_3 = 5000$. B: Longitudinal data with $q = 1000$ subjects, and $K = 3$ polynomial terms on $t$, random mode. C: spatial data with $q = 10000$ locations, and an RBF kernel.
  • Figure 3: Exploring the $\mathbf{\hat{B}}$ RE matrix from the Cars dataset containing spatial features, with a raster plot. Left: distribution across the US of the $\mathbf{\hat{B}}$ column corresponding to the price feature; Right: distribution across the US of the $\mathbf{\hat{B}}$ column corresponding to the odometer feature.
  • Figure 4: Comparing true vs. reconstructed $\mathbf{X}_{te}$ for the Rossmann stores longitudinal dataset, Future mode with $d = 1$. In this mode the model is trained on the first 25 months of dataset to reconstruct the last 6 months of the dataset. Left: comparing LMMVAE and VAE with entity embeddings on the sales feature; Right: comparing LMMVAE and VAE with entity embeddings on the school days feature.
  • Figure 5: Comparing true vs. reconstructed $\mathbf{X}_{te}$ for the CelebA dataset of facial images with $d = 100$, using convolutional neural networks for both FE and RE encoders. Left: true faces; Right: reconstructed faces.