Table of Contents
Fetching ...

Deep Generative Sampling in the Dual Divergence Space: A Data-efficient & Interpretative Approach for Generative AI

Sahil Garg, Anderson Schneider, Anant Raj, Kashif Rasul, Yuriy Nevmyvaka, Sneihil Gopal, Amit Dhurandhar, Guillermo Cecchi, Irina Rish

TL;DR

The paper addresses the challenge of generating high-dimensional time-series data treated as images under small-sample regimes, where traditional decoders or diffusion models risk overfitting. It introduces a novel approach that estimates the KL-divergence in its dual form between the data distribution and the product-of-marginals base, enabling direct sampling in a 1-D dual space. A path-based and localized divergence estimation framework is developed to embody dependencies and enable multi-scale clustering, with gradient-walk sampling in the resulting dual-space gaps. The method is backed by theoretical variance/complexity considerations and validated empirically across eight diverse domains, often outperforming standard baselines on multiple information-theoretic and diversity metrics, demonstrating practical impact for data-efficient generative modeling in healthcare, finance, and environmental monitoring.

Abstract

Building on the remarkable achievements in generative sampling of natural images, we propose an innovative challenge, potentially overly ambitious, which involves generating samples of entire multivariate time series that resemble images. However, the statistical challenge lies in the small sample size, sometimes consisting of a few hundred subjects. This issue is especially problematic for deep generative models that follow the conventional approach of generating samples from a canonical distribution and then decoding or denoising them to match the true data distribution. In contrast, our method is grounded in information theory and aims to implicitly characterize the distribution of images, particularly the (global and local) dependency structure between pixels. We achieve this by empirically estimating its KL-divergence in the dual form with respect to the respective marginal distribution. This enables us to perform generative sampling directly in the optimized 1-D dual divergence space. Specifically, in the dual space, training samples representing the data distribution are embedded in the form of various clusters between two end points. In theory, any sample embedded between those two end points is in-distribution w.r.t. the data distribution. Our key idea for generating novel samples of images is to interpolate between the clusters via a walk as per gradients of the dual function w.r.t. the data dimensions. In addition to the data efficiency gained from direct sampling, we propose an algorithm that offers a significant reduction in sample complexity for estimating the divergence of the data distribution with respect to the marginal distribution. We provide strong theoretical guarantees along with an extensive empirical evaluation using many real-world datasets from diverse domains, establishing the superiority of our approach w.r.t. state-of-the-art deep learning methods.

Deep Generative Sampling in the Dual Divergence Space: A Data-efficient & Interpretative Approach for Generative AI

TL;DR

The paper addresses the challenge of generating high-dimensional time-series data treated as images under small-sample regimes, where traditional decoders or diffusion models risk overfitting. It introduces a novel approach that estimates the KL-divergence in its dual form between the data distribution and the product-of-marginals base, enabling direct sampling in a 1-D dual space. A path-based and localized divergence estimation framework is developed to embody dependencies and enable multi-scale clustering, with gradient-walk sampling in the resulting dual-space gaps. The method is backed by theoretical variance/complexity considerations and validated empirically across eight diverse domains, often outperforming standard baselines on multiple information-theoretic and diversity metrics, demonstrating practical impact for data-efficient generative modeling in healthcare, finance, and environmental monitoring.

Abstract

Building on the remarkable achievements in generative sampling of natural images, we propose an innovative challenge, potentially overly ambitious, which involves generating samples of entire multivariate time series that resemble images. However, the statistical challenge lies in the small sample size, sometimes consisting of a few hundred subjects. This issue is especially problematic for deep generative models that follow the conventional approach of generating samples from a canonical distribution and then decoding or denoising them to match the true data distribution. In contrast, our method is grounded in information theory and aims to implicitly characterize the distribution of images, particularly the (global and local) dependency structure between pixels. We achieve this by empirically estimating its KL-divergence in the dual form with respect to the respective marginal distribution. This enables us to perform generative sampling directly in the optimized 1-D dual divergence space. Specifically, in the dual space, training samples representing the data distribution are embedded in the form of various clusters between two end points. In theory, any sample embedded between those two end points is in-distribution w.r.t. the data distribution. Our key idea for generating novel samples of images is to interpolate between the clusters via a walk as per gradients of the dual function w.r.t. the data dimensions. In addition to the data efficiency gained from direct sampling, we propose an algorithm that offers a significant reduction in sample complexity for estimating the divergence of the data distribution with respect to the marginal distribution. We provide strong theoretical guarantees along with an extensive empirical evaluation using many real-world datasets from diverse domains, establishing the superiority of our approach w.r.t. state-of-the-art deep learning methods.
Paper Structure (17 sections, 2 theorems, 8 equations, 9 figures, 2 tables, 3 algorithms)

This paper contains 17 sections, 2 theorems, 8 equations, 9 figures, 2 tables, 3 algorithms.

Key Result

Theorem 1

Variance for the direct estimation of ${D}(P \| Q)$ in its dual form using $n$ samples is, whereas variance for its estimation via the dependency diffusion path, assuming the divergence estimates for each step to be independent, is:

Figures (9)

  • Figure 1: An illustration of the general paradigm followed by most approaches in the literature of deep generative sampling. The data distribution (represented by red dots) gradually evolves into a simpler canonical distribution, such as a Gaussian distribution, either through an encoder, as seen in and , or by adding noise, as is the case in . A canonical distribution facilitates the generation of novel samples which are then mapped back into the data distribution via a decoder as in , , etc., or via a denoising diffusion process as in . The intermediate distributions between the data distribution and the canonical distribution are implied in models such as while being explicit in or . One common limitation of these approaches is that they require a large sample size, which is not available for our problem.
  • Figure 2: A high-level illustration of our approach for generative sampling. Our key idea is to estimate empirical divergence in its dual form between the observed data points (red dots) and the respective samples (blue dots), so as to implicitly characterize the data distribution of interest in the 1-D dual functional space. The top sketch shows, in the dual space, samples from the two distributions (red vs blue) that are pulled in opposite directions to attain the maximal estimate of the divergence as the optimal measure. The boundary of real samples in the dual space (implicitly) represents the data distribution. For a finer-grained representation of real samples in the dual space, we estimate divergence locally between the nearest neighboring sets. Since the dual space is one-dimensional, it is highly interpretable and straightforward to identify regions (holes) of missing data points. Our algorithm generates (missing) samples in those holes via a gradient walk between the respective clusters. For robustness in generative sampling, we estimate divergence of the observations w.r.t. the generated samples, locally as well as globally.
  • Figure 3: An illustration of our approach for localized divergence estimation at cut points for multi-scale clustering. As established in gargitc23, clusters with maximal (empirical) divergence w.r.t. each other are contiguous in the dual space separated by cut points. As such, the divergence between two clusters at a cut point is estimated by computing softmax and mean statistics using all the samples from the clusters respectively. We instead propose to estimate divergence locally at a cut point by computing the statistics only on the nearest neighbors of the cut point from either side, as shown above. By maximizing localized divergence between neighbors for a small number of cut points while minimizing it on average, we accomplish multi-scale clustering. Having optimized a fine-grained representation of real samples in the dual space, novel samples can be generated in empty space between the clusters (or between distant neighboring data points).
  • Figure 4: An illustration for sample efficient divergence estimation w.r.t. marginals via dependency diffusion. For an input image, there is an underlying dependency structure between all the dimensions (pixels) which is unknown and never learned explicitly. From the left to right, in each step, we choose a subset of columns in the image and replace the values in there with samples from the respective marginal densities. This, in essence, diffuses the dependencies of the subset of the columns w.r.t. each other (as well as between pixels within each column) and w.r.t. all the other columns as shown in the dependency graphs. In this manner, we obtain samples for all the intermediate densities as well as the marginal distribution without having to learn the density functions or the respective dependency graphs. For each pair of adjacent distributions on the path, $Q^{j-1}$ and $Q^j$, we obtain an empirical estimate of $D(Q^{j-1} \| Q^j)$ in its dual form, and the sum of all the divergence measures along the path gives us an estimate of divergence of the data distribution w.r.t. the marginals, $D(P \| Q)$. The estimation via the path avoids the otherwise exponential sample complexity w.r.t. true measure of $D(P \| Q)$ (in LB by song2019understanding).
  • Figure 5: All the methods are compared in terms of KL-Divergence of generated samples w.r.t. data distribution which is desired to be minimized while maximizing entropy of the generated samples.
  • ...and 4 more figures

Theorems & Definitions (2)

  • Theorem 1
  • Theorem 2