Table of Contents
Fetching ...

Learning from Conditional Distributions via Dual Embeddings

Bo Dai, Niao He, Yunpeng Pan, Byron Boots, Le Song

TL;DR

The paper tackles learning from conditional distributions p(z|x) under severe sample limitations by reframing the problem via Fenchel duality into a saddle-point form that operates on the joint distribution p(x,z,y). It introduces Embedding-SGD, a kernel-based, sample-efficient algorithm that jointly optimizes a primal function and a dual function within RKHSs, achieving a theoretical O(1/ε^2) sample complexity and enabling one-sample-at-a-time updates. The framework unifies and improves approaches to learning with invariance and policy evaluation in reinforcement learning, and extends to stochastic-process predictions such as hitting times. Empirical results on invariance learning and policy evaluation demonstrate superior performance and robustness, validating the method’s practical impact and its potential for broad applicability and extensions (random features, neural-dual structures).

Abstract

Many machine learning tasks, such as learning with invariance and policy evaluation in reinforcement learning, can be characterized as problems of learning from conditional distributions. In such problems, each sample $x$ itself is associated with a conditional distribution $p(z|x)$ represented by samples $\{z_i\}_{i=1}^M$, and the goal is to learn a function $f$ that links these conditional distributions to target values $y$. These learning problems become very challenging when we only have limited samples or in the extreme case only one sample from each conditional distribution. Commonly used approaches either assume that $z$ is independent of $x$, or require an overwhelmingly large samples from each conditional distribution. To address these challenges, we propose a novel approach which employs a new min-max reformulation of the learning from conditional distribution problem. With such new reformulation, we only need to deal with the joint distribution $p(z,x)$. We also design an efficient learning algorithm, Embedding-SGD, and establish theoretical sample complexity for such problems. Finally, our numerical experiments on both synthetic and real-world datasets show that the proposed approach can significantly improve over the existing algorithms.

Learning from Conditional Distributions via Dual Embeddings

TL;DR

The paper tackles learning from conditional distributions p(z|x) under severe sample limitations by reframing the problem via Fenchel duality into a saddle-point form that operates on the joint distribution p(x,z,y). It introduces Embedding-SGD, a kernel-based, sample-efficient algorithm that jointly optimizes a primal function and a dual function within RKHSs, achieving a theoretical O(1/ε^2) sample complexity and enabling one-sample-at-a-time updates. The framework unifies and improves approaches to learning with invariance and policy evaluation in reinforcement learning, and extends to stochastic-process predictions such as hitting times. Empirical results on invariance learning and policy evaluation demonstrate superior performance and robustness, validating the method’s practical impact and its potential for broad applicability and extensions (random features, neural-dual structures).

Abstract

Many machine learning tasks, such as learning with invariance and policy evaluation in reinforcement learning, can be characterized as problems of learning from conditional distributions. In such problems, each sample itself is associated with a conditional distribution represented by samples , and the goal is to learn a function that links these conditional distributions to target values . These learning problems become very challenging when we only have limited samples or in the extreme case only one sample from each conditional distribution. Commonly used approaches either assume that is independent of , or require an overwhelmingly large samples from each conditional distribution. To address these challenges, we propose a novel approach which employs a new min-max reformulation of the learning from conditional distribution problem. With such new reformulation, we only need to deal with the joint distribution . We also design an efficient learning algorithm, Embedding-SGD, and establish theoretical sample complexity for such problems. Finally, our numerical experiments on both synthetic and real-world datasets show that the proposed approach can significantly improve over the existing algorithms.

Paper Structure

This paper contains 47 sections, 5 theorems, 67 equations, 3 figures, 7 algorithms.

Key Result

Lemma 1

Let $\xi$ be a random variable on $\Xi$ and assume for any $\xi\in \Xi$, function $g(\cdot,\xi):\mathbb{R}\to(-\infty,+\infty)$ is a properWe say $g(\cdot, \xi)$ is proper when $\{u\in \mathbb{R}: g(u, \xi)<\infty\}$ is non-empty and $g(u, \xi)>-\infty$ for $\forall u$. and upper semicontinuousWe sa where $\mathcal{G}(\Xi)=\{u(\cdot):\Xi\to\mathbb{R}\}$ is the entire space of functions defined on

Figures (3)

  • Figure 1: Toy example with $f^*$ sampled from a Gaussian processes. The $y$ at position $x$ is obtained by smoothing $f^*$ with a Gaussian distribution condition on location $x$, i.e., $y = \mathbb{E}_{z|x}\left[f^*(z)\right]$ where $z\sim p(z|x) = \mathcal{N}\left(x, 0.3\right)$. Given samples $\{x, y\}$, the task is to recover $f^*(\cdot)$. The blue dash curve is the ground-truth $f^*(\cdot)$. The cyan curve is the observed noisy $y$. The red curve is the recovered signal $f(\cdot)$ and the green curve denotes the dual function $u(\cdot, y)$ with the observed $y$ plugged for each corresponding position $x$. Indeed, the dual function $u(\cdot, y)$ emphasizes the difference between $y$ and $\mathbb{E}_{z|x}\left[f(z)\right]$ on every $x$. The interaction between primal $f(\cdot)$ and dual $u(\cdot, y)$ results in the recovery of the denoised signal.
  • Figure 2: Learning with invariance.
  • Figure 3: Policy evaluation.

Theorems & Definitions (5)

  • Lemma 1: interchangeability principle
  • Proposition 1
  • Theorem 1
  • Proposition 2
  • Lemma 2