TimeAutoDiff: A Unified Framework for Generation, Imputation, Forecasting, and Time-Varying Metadata Conditioning of Heterogeneous Time Series Tabular Data

Namjoon Suh; Yuning Yang; Din-Yin Hsieh; Qitong Luan; Shirong Xu; Shixiang Zhu; Guang Cheng

TimeAutoDiff: A Unified Framework for Generation, Imputation, Forecasting, and Time-Varying Metadata Conditioning of Heterogeneous Time Series Tabular Data

Namjoon Suh, Yuning Yang, Din-Yin Hsieh, Qitong Luan, Shirong Xu, Shixiang Zhu, Guang Cheng

TL;DR

TimeAutoDiff presents a unified latent-diffusion framework that handles heterogeneous time-series data across generation, imputation, forecasting, and TV-MCG tasks by mapping mixed-type features into a continuous latent space via a VAE and applying diffusion in that latent space. The approach uses a task-agnostic masking scheme to unify objectives, with efficiency boosted by latent-space diffusion and feature-axis compression. Empirical results show strong fidelity, improved imputation and forecasting performance, and realistic metadata-conditioned trajectory generation, along with informative ablations and robustness analyses. The work highlights a practical, scalable path for multi-task time-series synthesis and scenario exploration, with avenues for privacy, interpretability, and foundation-model extensions.

Abstract

We present TimeAutoDiff, a unified latent-diffusion framework for four fundamental time-series tasks: unconditional generation, missing-data imputation, forecasting, and time-varying-metadata conditional generation. The model natively supports heterogeneous features including continuous, binary, and categorical variables. We unify all tasks using a masked-modeling strategy in which a binary mask specifies which time-series cells are observed and which must be generated. TimeAutoDiff combines a lightweight variational autoencoder, which maps mixed-type features into a continuous latent sequence, with a diffusion model that learns temporal dynamics in this latent space. Two architectural choices provide strong speed and scalability benefits. The diffusion model samples an entire latent trajectory at once rather than denoising one timestep at a time, greatly reducing reverse-diffusion calls. In addition, the VAE compresses along the feature axis, enabling efficient modeling of wide tables in a low-dimensional latent space. Empirical evaluation shows that TimeAutoDiff matches or surpasses strong baselines in synthetic sequence fidelity and consistently improves imputation and forecasting performance. Metadata conditioning enables realistic scenario exploration, allowing users to edit metadata sequences and produce coherent counterfactual trajectories that preserve cross-feature dependencies. Ablation studies highlight the importance of the VAE's feature encoding and key components of the denoiser. A distance-to-closest-record audit further indicates that the model generalizes without excessive memorization. Code is available at https://github.com/namjoonsuh/TimeAutoDiff

TimeAutoDiff: A Unified Framework for Generation, Imputation, Forecasting, and Time-Varying Metadata Conditioning of Heterogeneous Time Series Tabular Data

TL;DR

Abstract

Paper Structure (40 sections, 43 equations, 19 figures, 13 tables, 2 algorithms)

This paper contains 40 sections, 43 equations, 19 figures, 13 tables, 2 algorithms.

Introduction
Relevant Literatures
Problem Setting
Method
Objective function
Pre- and post-processing steps in TimeAutoDiff
Variational Autoencoder in TimeAutoDiff
Diffusion Model in TimeAutoDiff
Training, Inference & Computational Efficiency of TimeAutoDiff
Experiments
Unconditional Generation
Experimental Setting
Unconditional Generation
Imputation $\&$ Forecasting
Time Varying-Metadata Conditional Generation (TV-MCG)
...and 25 more sections

Figures (19)

Figure 1: The overview of TimeAutoDiff (Unconditional Generation): the model has three components: (1) pre- and post-processing steps for the original (i.e., $\mathbf{X}^{\text{Orig}}$) and synthesized data (i.e., $\tilde{\mathbf{X}}^{\text{Post}}$); (2) VAE for training encoder and decoder, and for projecting the pre-processed data to the latent space; (3) Diffusion model for learning the distribution of projected data in latent space and generating new latent data. Notably, the feature dimension can be compressed in the latent space such that $L \leq F$.
Figure 2: Illustration of four binary‐mask $\mathbf{M}$ (shaded cells $=1$) on a $T\times F$ timeseries grid: (A) Unconditional Generation ($\mathbf{M}^U$): all $T\times F$ entries shaded. (B) Missing‐Data Imputation ($\mathbf{M}^I_{t,f}=\mathbbm{1}(X_{t,f}\text{ is missing})$): shaded cells mark missing entries, and unshaded cells indicate observed values. (C) Forecasting ($\mathbf{M}^F_{t,f}=\mathbbm{1}(t>w)$): only rows with $t>w$ shaded, the first $w$ rows serve as conditioning. (D) Metadata‐Conditional ($\mathbf{M}^M_{t,f}=\mathbbm{1}(f\in\mathcal{F})$): only columns corresponding to features in $\mathcal{F}$ are shaded.
Figure 3: Schematic architecture of the encoder in the variational autoencoder (VAE). The encoder takes pre-processed multivariate time series input $\mathbf{X} = [\mathbf{x}^{\text{Proc}}_{\text{Disc}}; \mathbf{x}^{\text{Proc}}_{\text{Cont}}] \in \mathbb{R}^{T \times F}$ composed of $m$ discrete and $c$ continuous features, where $F = m + c$. Discrete features are embedded via a lookup table $\mathbf{e}(\cdot) \in \mathbb{R}^{d}$, while continuous features are transformed using frequency-based representations equation \ref{['FR']} to capture spectral information. At each time step $t$, the embeddings are concatenated into $\mathbf{E}(t) \in \mathbb{R}^{(m+c)d}$ and passed through an MLP to yield feature embeddings $\mathbf{f}_t \in \mathbb{R}^{F}$. The full sequence $\{\mathbf{f}_t\}_{t=1}^{T}$ is then processed independently by two RNNs to model temporal dependencies: one RNN estimates the mean vector $\boldsymbol{\mu} \in \mathbb{R}^{T \times L}$, and the other the log-variance $\log \boldsymbol{\sigma}^2 \in \mathbb{R}^{T \times L}$ of the approximate posterior. The latent trajectory $\mathbf{Z}_0^{\text{Lat}} \in \mathbb{R}^{T \times L}$ is obtained by sampling via the reparameterization trick: $\mathbf{Z}_0^{\text{Lat}} = \boldsymbol{\mu} + \mathbf{E} \odot \boldsymbol{\Sigma}$, where $\boldsymbol{\Sigma} = \exp(0.5 \log \boldsymbol{\sigma}^2)$ and $\mathbf{E} \sim \mathcal{N}(0, \mathbf{I})$. This encoder compresses the feature dimension from $F$ to $L$ while preserving temporal resolution.
Figure 4: VAE decoder with conditioning and modality-specific masking. A latent trajectory $\mathbf{Z}^{\text{Lat}}_{0}$ is mapped by an MLP to a hidden sequence $\mathbf{H}_{dec}$. Conditioning information $\mathbf{X}^{\text{con}}$ is embedded (Emb) and projected with a Linear layer, then added to the hidden state to form $\tilde{\mathbf{H}}_{dec}=\mathbf{H}_{dec}+\mathrm{Linear}(\mathrm{Emb}(\mathbf{X}^{\text{con}}))$, injecting context at every time step. From $\tilde{\mathbf{H}}_{dec}$, three modality-specific heads produce targets: Binary — $\mathrm{Linear}(\tilde{\mathbf{H}}_{dec})+\mathbf{M}_{\text{bin}}\rightarrow \mathbf{x}^{\text{tar}}_{\text{Bin}}$ (logits with an additive mask that biases masked positions); Categorical — for each categorical variable $i$, $\mathrm{Linear}[i](\tilde{\mathbf{H}}_{dec})+\mathbf{M}^{(i)}_{\text{cat}}\rightarrow \mathbf{x}^{\text{tar},i}_{\text{Cat}}$ (class logits with an additive mask that routes masked entries to a designated class, e.g., index $0$); Numerical — $\sigma(\mathrm{Linear}(\tilde{\mathbf{H}}_{dec}))\odot\mathbf{M}_{\text{num}}\rightarrow \mathbf{x}^{\text{tar}}_{\text{Num}}$ (values with a multiplicative elementwise gate). Type-specific masks $\{\mathbf{M}_{\text{bin}},\,\mathbf{M}^{(i)}_{\text{cat}},\,\mathbf{M}_{\text{num}}\}$ are derived from the task mask $\mathbf{M}^{\text{task}}$ by splitting and reshaping according to feature types, so that binary/categorical channels use additive logit biasing while numerical channels use elementwise gating.
Figure 5: The schematic architecture of the denoising model $\epsilon_{\theta}(\mathbf{Z}_{n}^{\text{Lat}}, n, \mathbf{t}, \textbf{ts})$ in the diffusion framework. The inputs to the model $\epsilon_{\theta}$ include the noisy latent matrix $\mathbf{Z}_{n}^{\text{Lat}}$ at the $n$th diffusion step, the diffusion step index $n$, the normalized time points $\mathbf{t}$, and the periodic timestamp embeddings $\textbf{ts}$, projected through a multilayer perceptron (MLP). When conditional data $\mathbf{X}^{\text{con}} = [c_{\text{disc}}, c_{\text{cont}}]$ is available, it is embedded using a word embedding (WE) for discrete variables and frequency-based representations (FR) for continuous variables, producing $\mathbf{Z}^{\text{con}} := \textbf{Emb}(\mathbf{X}^{\text{con}})$. These embeddings are processed through a Bi-directional RNN (Bi-RNN) to capture temporal correlations, and the output is linearly projected and fused with $\mathbf{Z}_{n}^{\text{Lat}}$ to produce $\mathbf{Z}_n'$. All components—$\mathbf{Z}_n'$, $n$, $\mathbf{t}$, and $\textbf{ts}$—are passed through positional encodings and MLPs before concatenation. Here, $\bigoplus$ denotes matrix summation. The final block of RNNs, followed by layer normalization or fully connected layers (LN/FC), produces the diffusion stepwise prediction of noise $\epsilon_{\theta}(\mathbf{Z}_{n}^{\text{Lat}}, n, \mathbf{t}, \textbf{ts})$.
...and 14 more figures

TimeAutoDiff: A Unified Framework for Generation, Imputation, Forecasting, and Time-Varying Metadata Conditioning of Heterogeneous Time Series Tabular Data

TL;DR

Abstract

TimeAutoDiff: A Unified Framework for Generation, Imputation, Forecasting, and Time-Varying Metadata Conditioning of Heterogeneous Time Series Tabular Data

Authors

TL;DR

Abstract

Table of Contents

Figures (19)