Table of Contents
Fetching ...

Probabilistic Digital Twins of Users: Latent Representation Learning with Statistically Validated Semantics

Daniel David

TL;DR

This work introduces a probabilistic digital twin framework where each user is represented by a latent stochastic state that generates observed behavior, enabling principled uncertainty quantification. The model uses amortized variational inference within a VAE, and a statistically validated interpretation pipeline links latent dimensions to observable patterns, revealing mostly continuous latent structure with a few dominant axes (notably Dimension 33). The approach provides uncertainty-aware, interpretable user representations and highlights the importance of data design for identifiability, suggesting richer measurements could uncover discrete user types. Overall, probabilistic digital twins offer a robust alternative to deterministic embeddings for understanding and interpreting user behavior.

Abstract

Understanding user identity and behavior is central to applications such as personalization, recommendation, and decision support. Most existing approaches rely on deterministic embeddings or black-box predictive models, offering limited uncertainty quantification and little insight into what latent representations encode. We propose a probabilistic digital twin framework in which each user is modeled as a latent stochastic state that generates observed behavioral data. The digital twin is learned via amortized variational inference, enabling scalable posterior estimation while retaining a fully probabilistic interpretation. We instantiate this framework using a variational autoencoder (VAE) applied to a user-response dataset designed to capture stable aspects of user identity. Beyond standard reconstruction-based evaluation, we introduce a statistically grounded interpretation pipeline that links latent dimensions to observable behavioral patterns. By analyzing users at the extremes of each latent dimension and validating differences using nonparametric hypothesis tests and effect sizes, we demonstrate that specific dimensions correspond to interpretable traits such as opinion strength and decisiveness. Empirically, we find that user structure is predominantly continuous rather than discretely clustered, with weak but meaningful structure emerging along a small number of dominant latent axes. These results suggest that probabilistic digital twins can provide interpretable, uncertainty-aware representations that go beyond deterministic user embeddings.

Probabilistic Digital Twins of Users: Latent Representation Learning with Statistically Validated Semantics

TL;DR

This work introduces a probabilistic digital twin framework where each user is represented by a latent stochastic state that generates observed behavior, enabling principled uncertainty quantification. The model uses amortized variational inference within a VAE, and a statistically validated interpretation pipeline links latent dimensions to observable patterns, revealing mostly continuous latent structure with a few dominant axes (notably Dimension 33). The approach provides uncertainty-aware, interpretable user representations and highlights the importance of data design for identifiability, suggesting richer measurements could uncover discrete user types. Overall, probabilistic digital twins offer a robust alternative to deterministic embeddings for understanding and interpreting user behavior.

Abstract

Understanding user identity and behavior is central to applications such as personalization, recommendation, and decision support. Most existing approaches rely on deterministic embeddings or black-box predictive models, offering limited uncertainty quantification and little insight into what latent representations encode. We propose a probabilistic digital twin framework in which each user is modeled as a latent stochastic state that generates observed behavioral data. The digital twin is learned via amortized variational inference, enabling scalable posterior estimation while retaining a fully probabilistic interpretation. We instantiate this framework using a variational autoencoder (VAE) applied to a user-response dataset designed to capture stable aspects of user identity. Beyond standard reconstruction-based evaluation, we introduce a statistically grounded interpretation pipeline that links latent dimensions to observable behavioral patterns. By analyzing users at the extremes of each latent dimension and validating differences using nonparametric hypothesis tests and effect sizes, we demonstrate that specific dimensions correspond to interpretable traits such as opinion strength and decisiveness. Empirically, we find that user structure is predominantly continuous rather than discretely clustered, with weak but meaningful structure emerging along a small number of dominant latent axes. These results suggest that probabilistic digital twins can provide interpretable, uncertainty-aware representations that go beyond deterministic user embeddings.

Paper Structure

This paper contains 21 sections, 5 equations, 13 figures, 2 tables.

Figures (13)

  • Figure 1: Block and plate diagram of the probabilistic digital twin model. For each user $u$, a latent digital twin state $z_u \sim p(z)$ is transformed by a neural decoder $f_\theta$ to parameterize the observation distribution $p_\theta(x_u \mid z_u)$, generating observed behavior $x_u$. Inference is performed via an encoder $q_\phi(z_u \mid x_u)$ (dashed arrows).
  • Figure 2: Training and validation loss curves for the standard and hierarchical VAE models. The hierarchical VAE converges more rapidly and achieves lower loss on both training and validation data, indicating improved model fit.
  • Figure 3: Low-dimensional visualization of the learned latent space using PCA. Each point corresponds to a user and is colored by the dominant latent dimension. The latent structure appears predominantly continuous, with variation organized along a small number of axes.
  • Figure 4: Cluster-wise comparison of response extremity versus neutrality. Clusters primarily reflect variation along the dominant latent dimension rather than sharply distinct user groups, consistent with a continuous latent structure.
  • Figure 5: Latent dimension importance analysis. Shown are variance- and range-based rankings across latent dimensions, along with distributional summaries for a subset of high-importance dimensions. Dimension 33 consistently emerges as the most salient axis under multiple quantitative criteria, motivating its selection for detailed interpretation.
  • ...and 8 more figures