Table of Contents
Fetching ...

GUIDE-VAE: Advancing Data Generation with User Information and Pattern Dictionaries

Kutay Bölat, Simon Tindemans

Abstract

Generative modelling of multi-user datasets has become prominent in science and engineering. Generating a data point for a given user requires employing user information, and conventional generative models, including variational autoencoders (VAEs), often ignore that. This paper introduces GUIDE-VAE, a novel conditional generative model that leverages user embeddings to generate user-guided data. By allowing the model to benefit from shared patterns across users, GUIDE-VAE enhances performance in multi-user settings, even under significant data imbalance. In addition to integrating user information, GUIDE-VAE incorporates a pattern dictionary-based covariance composition (PDCC) to improve the realism of generated samples by capturing complex feature dependencies. While user embeddings drive performance gains, PDCC addresses common issues such as noise and over-smoothing typically seen in VAEs. The proposed GUIDE-VAE was evaluated on a multi-user smart meter dataset characterized by substantial data imbalance across users. Quantitative results show that GUIDE-VAE performs effectively in both synthetic data generation and missing record imputation tasks, while qualitative evaluations reveal that GUIDE-VAE produces more plausible and less noisy data. These results establish GUIDE-VAE as a promising tool for controlled, realistic data generation in multi-user datasets, with potential applications across various domains requiring user-informed modelling.

GUIDE-VAE: Advancing Data Generation with User Information and Pattern Dictionaries

Abstract

Generative modelling of multi-user datasets has become prominent in science and engineering. Generating a data point for a given user requires employing user information, and conventional generative models, including variational autoencoders (VAEs), often ignore that. This paper introduces GUIDE-VAE, a novel conditional generative model that leverages user embeddings to generate user-guided data. By allowing the model to benefit from shared patterns across users, GUIDE-VAE enhances performance in multi-user settings, even under significant data imbalance. In addition to integrating user information, GUIDE-VAE incorporates a pattern dictionary-based covariance composition (PDCC) to improve the realism of generated samples by capturing complex feature dependencies. While user embeddings drive performance gains, PDCC addresses common issues such as noise and over-smoothing typically seen in VAEs. The proposed GUIDE-VAE was evaluated on a multi-user smart meter dataset characterized by substantial data imbalance across users. Quantitative results show that GUIDE-VAE performs effectively in both synthetic data generation and missing record imputation tasks, while qualitative evaluations reveal that GUIDE-VAE produces more plausible and less noisy data. These results establish GUIDE-VAE as a promising tool for controlled, realistic data generation in multi-user datasets, with potential applications across various domains requiring user-informed modelling.

Paper Structure

This paper contains 34 sections, 1 theorem, 14 equations, 16 figures.

Key Result

Theorem 1

Any positive definite matrix can be constructed using PDCC at least in $V(V-T)+1$ different ways.

Figures (16)

  • Figure 1: Conventional generative models disregard user information during training, treating the dataset anonymously, which limits their ability to generate data for specific users during inference. GUIDE-VAE addresses this by incorporating user information in the training process, enabling control over the generated outputs for individual users.
  • Figure 2: The user embedding framework for multi-user time-series datasets. Users' time-series data are segmented into profiles and clustered using $k$-means. These clusters are treated as words, and user datasets are treated as documents for LDA. After training, LDA produces a user dictionary $\Gamma$, where each element corresponds to the parameters of a Dirichlet distribution, serving as user embeddings.
  • Figure 3: Overall computational diagram of GUIDE-VAE. GUIDE-VAE is a CVAE-based model enhanced with a learnable pattern dictionary for PDCC, which captures feature dependencies for improved realism. The user of the data point is selected from the user dictionary, and a sample from the corresponding probabilistic embedding is concatenated with auxiliary conditions (e.g., metadata or timestamps) and applied to both the encoder and decoder.
  • Figure 4: The pmf of beta-binomial distribution ($n$=365, $a$=0.85) in logarithmic scale for different $b$ values.
  • Figure 5: The data splitting scheme used in experimentation. A full dataset where each user has an equal number of profiles first amputated according to beta-binomial distribution, and these are reserved in the missing set (in red). The remaining randomly split into training (green), validation (blues) and testing (yellow) sets.
  • ...and 11 more figures

Theorems & Definitions (3)

  • Definition 1: Zero-preserved log-normalization
  • Theorem 1
  • proof