Table of Contents
Fetching ...

Diffusion Transformers for Imputation: Statistical Efficiency and Uncertainty Quantification

Zeqi Ye, Minshuo Chen

TL;DR

The paper develops a theory for diffusion-transformer based imputation in time-series with missing values, focusing on learning the conditional distribution $P(\mathbf{x}_{\text{miss}}|\mathbf{x}_{\text{obs}})$ for GP data. It establishes statistical efficiency results with a novel score-approximation framework based on algorithm unrolling, and provides uncertainty quantification through confidence regions with explicit coverage guarantees that depend on distribution shift and missing-pattern conditioning. A key contribution is a mixed-masking training strategy that exposes the model to diverse missing patterns to improve robustness and reduce distribution shift. Theoretical bounds are complemented by experiments on Gaussian and latent GP data showing improved imputation quality and reliable uncertainty quantification, guiding practical design of diffusion-based imputers for time-series data.

Abstract

Imputation methods play a critical role in enhancing the quality of practical time-series data, which often suffer from pervasive missing values. Recently, diffusion-based generative imputation methods have demonstrated remarkable success compared to autoregressive and conventional statistical approaches. Despite their empirical success, the theoretical understanding of how well diffusion-based models capture complex spatial and temporal dependencies between the missing values and observed ones remains limited. Our work addresses this gap by investigating the statistical efficiency of conditional diffusion transformers for imputation and quantifying the uncertainty in missing values. Specifically, we derive statistical sample complexity bounds based on a novel approximation theory for conditional score functions using transformers, and, through this, construct tight confidence regions for missing values. Our findings also reveal that the efficiency and accuracy of imputation are significantly influenced by the missing patterns. Furthermore, we validate these theoretical insights through simulation and propose a mixed-masking training strategy to enhance the imputation performance.

Diffusion Transformers for Imputation: Statistical Efficiency and Uncertainty Quantification

TL;DR

The paper develops a theory for diffusion-transformer based imputation in time-series with missing values, focusing on learning the conditional distribution for GP data. It establishes statistical efficiency results with a novel score-approximation framework based on algorithm unrolling, and provides uncertainty quantification through confidence regions with explicit coverage guarantees that depend on distribution shift and missing-pattern conditioning. A key contribution is a mixed-masking training strategy that exposes the model to diverse missing patterns to improve robustness and reduce distribution shift. Theoretical bounds are complemented by experiments on Gaussian and latent GP data showing improved imputation quality and reliable uncertainty quantification, guiding practical design of diffusion-based imputers for time-series data.

Abstract

Imputation methods play a critical role in enhancing the quality of practical time-series data, which often suffer from pervasive missing values. Recently, diffusion-based generative imputation methods have demonstrated remarkable success compared to autoregressive and conventional statistical approaches. Despite their empirical success, the theoretical understanding of how well diffusion-based models capture complex spatial and temporal dependencies between the missing values and observed ones remains limited. Our work addresses this gap by investigating the statistical efficiency of conditional diffusion transformers for imputation and quantifying the uncertainty in missing values. Specifically, we derive statistical sample complexity bounds based on a novel approximation theory for conditional score functions using transformers, and, through this, construct tight confidence regions for missing values. Our findings also reveal that the efficiency and accuracy of imputation are significantly influenced by the missing patterns. Furthermore, we validate these theoretical insights through simulation and propose a mixed-masking training strategy to enhance the imputation performance.

Paper Structure

This paper contains 64 sections, 20 theorems, 201 equations, 4 figures, 8 tables, 2 algorithms.

Key Result

Lemma 1

Suppose Assumption assump:data_assumption holds. For an arbitrarily fixed time $t \in (0,T]$, given an error tolerance $\epsilon \in (0,1)$, choose $K, K_{\rm aux}$ as Then, given $\delta > 0$, for any $\mathbf{x}_{\rm obs}$ and $\mathbf{v}_t$ in a compact region $\mathcal{C}_\delta$, there exist step sizes $\eta_t$ and $\theta$ such that running Algorithm alg:double_gd gives rise to

Figures (4)

  • Figure 1: Constructed transformer architecture: Within each transformer block, attention heads focus on capturing information of different covariance components ($\bm{\Sigma}_{\rm obs}$, $\bm{\Sigma}_{\rm cor}$, $\bm{\Sigma}_{\rm miss}$) separately, and approximate corresponding matrix–vector multiplications. A total of $K$ block groups perform major GD steps, with $K_{\rm aux}$ inner blocks in each group dedicated to solving the auxiliary problem.
  • Figure 2: Visualization of the four missing patterns for a sequence of length 96. Each horizontal line shows the positions of missing values (highlighted in blue, orange, green and red for Patterns 1-4), and annotations on the right indicate the pattern number and its condition number $\kappa(\bm{\Sigma}_{\rm cond})$.
  • Figure 3: Percentage of real data samples that fall within the DiT‑generated 95% CR.
  • Figure 4: Comparison of imputation methods on the Electricity dataset, with 95% CR.

Theorems & Definitions (34)

  • Lemma 1: Representation error of Algorithm \ref{['alg:double_gd']}
  • Theorem 1
  • Theorem 2
  • Definition 1
  • Corollary 1
  • Remark 1
  • Lemma 2
  • Lemma 3
  • proof
  • Lemma 4
  • ...and 24 more