Table of Contents
Fetching ...

Adaptive joint distribution learning

Damir Filipovic, Michael Multerer, Paul Schneider

TL;DR

The paper addresses estimating joint distributions from samples with the crucial constraints of normalization and positivity by introducing the joint distribution learner (JDL) in a tensor-product RKHS. It derives a representer theorem that reduces the optimization to a bilinear form $h|_{\mathcal G} = {\bm K}_Y{\bm H}{\bm K}_X$, and proposes an adaptive low-rank scheme based on pivoted Cholesky and a double-orthogonal basis to enable fast learning on datasets with millions of points. Positivity tightenings (pointwise and single-inequality) and elementary error bounds are developed to maintain valid probability structures while keeping computation tractable. Numerical experiments on conditional moments and binary classification show JDL and its polynomial variant JPDL outperform traditional CME and perform competitively with kernel logistic regression, with scalability to high dimensions and very large $n$. The approach thus provides a scalable, principled framework for learning joint and conditional distributions in complex, large-scale settings with rigorous structural guarantees.

Abstract

We develop a new framework for estimating joint probability distributions using tensor product reproducing kernel Hilbert spaces (RKHS). Our framework accommodates a low-dimensional, normalized and positive model of a Radon--Nikodym derivative, which we estimate from sample sizes of up to several millions, alleviating the inherent limitations of RKHS modeling. Well-defined normalized and positive conditional distributions are natural by-products to our approach. Our proposal is fast to compute and accommodates learning problems ranging from prediction to classification. Our theoretical findings are supplemented by favorable numerical results.

Adaptive joint distribution learning

TL;DR

The paper addresses estimating joint distributions from samples with the crucial constraints of normalization and positivity by introducing the joint distribution learner (JDL) in a tensor-product RKHS. It derives a representer theorem that reduces the optimization to a bilinear form , and proposes an adaptive low-rank scheme based on pivoted Cholesky and a double-orthogonal basis to enable fast learning on datasets with millions of points. Positivity tightenings (pointwise and single-inequality) and elementary error bounds are developed to maintain valid probability structures while keeping computation tractable. Numerical experiments on conditional moments and binary classification show JDL and its polynomial variant JPDL outperform traditional CME and perform competitively with kernel logistic regression, with scalability to high dimensions and very large . The approach thus provides a scalable, principled framework for learning joint and conditional distributions in complex, large-scale settings with rigorous structural guarantees.

Abstract

We develop a new framework for estimating joint probability distributions using tensor product reproducing kernel Hilbert spaces (RKHS). Our framework accommodates a low-dimensional, normalized and positive model of a Radon--Nikodym derivative, which we estimate from sample sizes of up to several millions, alleviating the inherent limitations of RKHS modeling. Well-defined normalized and positive conditional distributions are natural by-products to our approach. Our proposal is fast to compute and accommodates learning problems ranging from prediction to classification. Our theoretical findings are supplemented by favorable numerical results.

Paper Structure

This paper contains 20 sections, 7 theorems, 73 equations, 11 figures, 1 algorithm.

Key Result

Lemma 2.1

We have

Figures (11)

  • Figure 1: Second-moment squared loss. The panels show loss function \ref{['eq:secmomloss']} evaluated over $n_{\textnormal{test}}=5{,}000$ samples for the joint distribution learner (JDL), the polynomial joint distribution learner (JPDL), and the conditional mean embedding (CME) on the $y$-axis. The $x$-axis shows the number $n$ of data points used for training and validation. The data are generated from a mean-zero, unit standard deviation multivariate Gaussian distribution with covariance matrix sampled from the algorithm proposed by ilyahensen21. For $n\leq 10^5$, low-rank algorithm \ref{['algo:bioChol']} is applied to the kernel matrices with tolerance $\varepsilon=10^{-3}$, for $n= 10^6$ with $\varepsilon=10^{-2}$, and for $n=10^7$ with $\varepsilon=10^{-1}$. JPDL features $\varepsilon=0$.
  • Figure 2: Low-rank tolerance and data size. The panels show loss function \ref{['eq:secmomloss']} evaluated over $n_{\textnormal{test}}=5{,}000$ samples for the joint distribution learner (JDL) and the conditional mean embedding (CME) on the $y$-axis. The $x$-axis shows the number $n$ of data points used for training and validation. The data are generated from a mean-zero, unit standard deviation multivariate Gaussian distribution with covariance matrix sampled from the algorithm proposed by ilyahensen21. Tolerances are indicated in the figures.
  • Figure 3: Normalization error. The panels show the maximal normalization error over $n_{\textnormal{test}}=5{,}000$ samples for the unconstrained joint distribution learner (JDL), the polynomial joint distribution learner (JPDL), and the conditional mean embedding (CME) on the $y$-axis. The $x$-axis shows the number $n$ of data points used for training and validation. The data are generated from a mean-zero, unit standard deviation multivariate Gaussian distribution with covariance matrix sampled from the algorithm proposed by ilyahensen21. For $n\leq 10^5$, low-rank algorithm \ref{['algo:bioChol']} is applied to the kernel matrices with tolerance $\varepsilon=10^{-3}$, for $n= 10^6$ with $\varepsilon=10^{-2}$, and for $n=10^7$ with $\varepsilon=10^{-1}$. JPDL features $\varepsilon=0$.
  • Figure 4: CME normalization error, low-rank and data size trade-off. The panel shows the maximal normalization error over $n_{\textnormal{test}}=5{,}000$ samples for the conditional mean embedding (CME). The $x$-axis shows the number $n$ of data points used for training and validation. The data are generated from a mean-zero, unit standard deviation multivariate Gaussian distribution for $d=3$, with covariance matrix sampled from the algorithm proposed by ilyahensen21. Tolerances are indicated in the legend.
  • Figure 5: Positivity error. The panels show the percentage of second-moment matrices over $n_{\textnormal{test}}=5{,}000$ samples that fail to be positive semidefinite for the joint distribution learner (JDL), the polynomial joint distribution learner (JPDL), and the conditional mean embedding (CME) on the $y$-axis. The $x$-axis shows the number $n$ of data points used for training and validation. The data are generated from a mean-zero, unit standard deviation multivariate Gaussian distribution with covariance matrix sampled from the algorithm proposed by ilyahensen21. For $n\leq 10^5$, low-rank algorithm \ref{['algo:bioChol']} is applied to the kernel matrices with tolerance $\varepsilon=10^{-3}$, for $n= 10^6$ with $\varepsilon=10^{-2}$, and for $n=10^7$ with $\varepsilon=10^{-1}$. JPDL features $\varepsilon=0$.
  • ...and 6 more figures

Theorems & Definitions (14)

  • Lemma 2.1
  • Lemma 2.2
  • Theorem 3.1
  • Theorem 4.1
  • Lemma 4.2: Non-negativity for bounded kernels
  • Lemma 4.3
  • Lemma 4.4
  • proof : Proof of Lemma \ref{['lem:intro']}
  • proof : Proof of Lemma \ref{['lemdist']}
  • proof : Proof of Theorem \ref{['thm:representer']}
  • ...and 4 more