Table of Contents
Fetching ...

Virtual Human Generative Model: Masked Modeling Approach for Learning Human Characteristics

Kenta Oono, Nontawat Charoenphakdee, Kotatsu Bito, Zhengyan Gao, Hideyoshi Igata, Masashi Yoshikawa, Yoshiaki Ota, Hiroki Okui, Kei Akita, Shoichiro Yamaguchi, Yohei Sugawara, Shin-ichi Maeda, Kunihiko Miyoshi, Yuki Saito, Koki Tsuda, Hiroshi Maruyama, Kohei Hayashi

TL;DR

VHGM-MAE addresses the challenge of modeling the joint distribution over more than $2000$ healthcare attributes from heterogeneous, highly incomplete data. It combines a transformer-based masked autoencoder with a likelihood-based per-attribute distribution to perform missing-value imputation and probabilistic sampling, capturing uncertainty in real, count, ordinal, categorical, and other attributes. The method features a single shared encoder, a shared decoder, and lightweight per-attribute decoders, augmented by mask augmentation and a two-stage training schedule to handle MNAR and cross-dataset variation. Empirical results on benchmark and real-world large-$p$ datasets show VHGM-MAE achieves competitive imputation accuracy and that synthetic data generated by the model can improve downstream predictive tasks when combined with real data. This work enables scalable, uncertainty-aware data completion and synthetic data generation in healthcare applications, while noting that it does not address causality or time-series modeling.

Abstract

Virtual Human Generative Model (VHGM) is a generative model that approximates the joint probability over more than 2000 human healthcare-related attributes. This paper presents the core algorithm, VHGM-MAE, a masked autoencoder (MAE) tailored for handling high-dimensional, sparse healthcare data. VHGM-MAE tackles four key technical challenges: (1) heterogeneity of healthcare data types, (2) probability distribution modeling, (3) systematic missingness in the training dataset arising from multiple data sources, and (4) the high-dimensional, small-$n$-large-$p$ problem. To address these challenges, VHGM-MAE employs a likelihood-based approach to model distributions with heterogeneous types, a transformer-based MAE to capture complex dependencies among observed and missing attributes, and a novel training scheme that effectively leverages available samples with diverse missingness patterns to mitigate the small-n-large-p problem. Experimental results demonstrate that VHGM-MAE outperforms existing methods in both missing value imputation and synthetic data generation.

Virtual Human Generative Model: Masked Modeling Approach for Learning Human Characteristics

TL;DR

VHGM-MAE addresses the challenge of modeling the joint distribution over more than healthcare attributes from heterogeneous, highly incomplete data. It combines a transformer-based masked autoencoder with a likelihood-based per-attribute distribution to perform missing-value imputation and probabilistic sampling, capturing uncertainty in real, count, ordinal, categorical, and other attributes. The method features a single shared encoder, a shared decoder, and lightweight per-attribute decoders, augmented by mask augmentation and a two-stage training schedule to handle MNAR and cross-dataset variation. Empirical results on benchmark and real-world large- datasets show VHGM-MAE achieves competitive imputation accuracy and that synthetic data generated by the model can improve downstream predictive tasks when combined with real data. This work enables scalable, uncertainty-aware data completion and synthetic data generation in healthcare applications, while noting that it does not address causality or time-series modeling.

Abstract

Virtual Human Generative Model (VHGM) is a generative model that approximates the joint probability over more than 2000 human healthcare-related attributes. This paper presents the core algorithm, VHGM-MAE, a masked autoencoder (MAE) tailored for handling high-dimensional, sparse healthcare data. VHGM-MAE tackles four key technical challenges: (1) heterogeneity of healthcare data types, (2) probability distribution modeling, (3) systematic missingness in the training dataset arising from multiple data sources, and (4) the high-dimensional, small--large- problem. To address these challenges, VHGM-MAE employs a likelihood-based approach to model distributions with heterogeneous types, a transformer-based MAE to capture complex dependencies among observed and missing attributes, and a novel training scheme that effectively leverages available samples with diverse missingness patterns to mitigate the small-n-large-p problem. Experimental results demonstrate that VHGM-MAE outperforms existing methods in both missing value imputation and synthetic data generation.
Paper Structure (21 sections, 2 equations, 2 figures, 6 tables)

This paper contains 21 sections, 2 equations, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Overview of VHGM's core use-case. Users provide healthcare data, which queries VHGM to obtain healthcare attribute inference and returns personalized services.
  • Figure 2: Overview of VHGM-MAE where $p=4$, $x_2$ and $x_4$ are missing inputs. Transformer is used for the encoder and common decoder, while linear models are used for the attribute-specific (Attr-spec) decoders to scale well when $p$ is large.