Table of Contents
Fetching ...

Inverse Entropic Optimal Transport Solves Semi-supervised Learning via Data Likelihood Maximization

Mikhail Persiianov, Arip Asadulaev, Nikita Andreev, Nikita Starodubcev, Dmitry Baranchuk, Anastasis Kratsios, Evgeny Burnaev, Alexander Korotin

TL;DR

The paper tackles learning conditional distributions $\pi^*(\cdot|x)$ in semi-supervised domain translation by recasting likelihood maximization within an inverse entropic OT (IOT) framework. It derives a loss that jointly leverages paired and unpaired data, establishing an equivalence to the inverse OT objective and enabling end-to-end learning via an energy-based, Gaussian-mixture parametrization. The authors prove a universal approximation property, showing the method can approximate the true conditional plan under mild conditions, and demonstrate empirical benefits on synthetic and real-world tasks, highlighting improved conditional density learning with limited labels. The approach unifies OT theory with probabilistic modeling, offering practical semi-supervised translation with potential extensions to more expressive neural parameterizations and high-dimensional settings.Overall, it provides a principled, likelihood-based route to recover $\pi^*(\cdot|x)$ in semi-supervised regimes, leveraging unpaired data through a tractable OT-inspired objective and energy-based modeling.

Abstract

Learning conditional distributions $π^*(\cdot|x)$ is a central problem in machine learning, which is typically approached via supervised methods with paired data $(x,y) \sim π^*$. However, acquiring paired data samples is often challenging, especially in problems such as domain translation. This necessitates the development of $\textit{semi-supervised}$ models that utilize both limited paired data and additional unpaired i.i.d. samples $x \sim π^*_x$ and $y \sim π^*_y$ from the marginal distributions. The usage of such combined data is complex and often relies on heuristic approaches. To tackle this issue, we propose a new learning paradigm that integrates both paired and unpaired data $\textbf{seamlessly}$ using the data likelihood maximization techniques. We demonstrate that our approach also connects intriguingly with inverse entropic optimal transport (OT). This finding allows us to apply recent advances in computational OT to establish an $\textbf{end-to-end}$ learning algorithm to get $π^*(\cdot|x)$. In addition, we derive the universal approximation property, demonstrating that our approach can theoretically recover true conditional distributions with arbitrarily small error. Furthermore, we demonstrate through empirical tests that our method effectively learns conditional distributions using paired and unpaired data simultaneously.

Inverse Entropic Optimal Transport Solves Semi-supervised Learning via Data Likelihood Maximization

TL;DR

The paper tackles learning conditional distributions in semi-supervised domain translation by recasting likelihood maximization within an inverse entropic OT (IOT) framework. It derives a loss that jointly leverages paired and unpaired data, establishing an equivalence to the inverse OT objective and enabling end-to-end learning via an energy-based, Gaussian-mixture parametrization. The authors prove a universal approximation property, showing the method can approximate the true conditional plan under mild conditions, and demonstrate empirical benefits on synthetic and real-world tasks, highlighting improved conditional density learning with limited labels. The approach unifies OT theory with probabilistic modeling, offering practical semi-supervised translation with potential extensions to more expressive neural parameterizations and high-dimensional settings.Overall, it provides a principled, likelihood-based route to recover in semi-supervised regimes, leveraging unpaired data through a tractable OT-inspired objective and energy-based modeling.

Abstract

Learning conditional distributions is a central problem in machine learning, which is typically approached via supervised methods with paired data . However, acquiring paired data samples is often challenging, especially in problems such as domain translation. This necessitates the development of models that utilize both limited paired data and additional unpaired i.i.d. samples and from the marginal distributions. The usage of such combined data is complex and often relies on heuristic approaches. To tackle this issue, we propose a new learning paradigm that integrates both paired and unpaired data using the data likelihood maximization techniques. We demonstrate that our approach also connects intriguingly with inverse entropic optimal transport (OT). This finding allows us to apply recent advances in computational OT to establish an learning algorithm to get . In addition, we derive the universal approximation property, demonstrating that our approach can theoretically recover true conditional distributions with arbitrarily small error. Furthermore, we demonstrate through empirical tests that our method effectively learns conditional distributions using paired and unpaired data simultaneously.
Paper Structure (38 sections, 8 theorems, 70 equations, 6 figures, 3 tables, 1 algorithm)

This paper contains 38 sections, 8 theorems, 70 equations, 6 figures, 3 tables, 1 algorithm.

Key Result

Proposition 3.1

Our parametrization of the cost function eq:cost and dual potential eq:dual_potential delivers $Z^\theta(x) \mathrel{\overset{\hbox{\normalfont\tiny def}}{=}} \sum_{m=1}^M\sum_{n=1}^N z_{mn}(x)$, where

Figures (6)

  • Figure 1: Visualization of domain translation setups. Red and green colors indicated paired training data $XY_{\text{paired}}$, while grey color indicates the unpaired training data $X_{\text{unpaired}}$, $Y_{\text{unpaired}}$.
  • Figure 2: Learned mapping on the $\textit{Gaussian}\to\textit{Swiss Roll}$ task for $P=128$ and $Q=R=1024$.
  • Figure 3: Performance of our Algorithm \ref{['algorithm-ebm']} in the $\textit{Gaussian}\to\textit{Swiss Roll}$ mapping task (\ref{['subsec:swiss_roll']}). We use MLPs to parametrize both the potential function $f^\theta$ and the cost function $c^\theta$.
  • Figure 4: Performance of our Algorithm \ref{['algorithm-ebm']} on the colored MNIST mapping task. Each pair consists of digits $2$ and $3$ with a hue shift of $120^\circ$. The first row shows the source images, the second row displays target images with ground-truth colors, the third row presents the mapping results for $10$ pairs in the train data, and the fourth row shows results for $200$ pairs.
  • Figure 5: Comparison of the mapping learned by baselines on $\textit{Gaussian}\to\textit{Swiss Roll}$ task (\ref{['subsec:swiss_roll']}). We use $P=16K$ paired data, $Q=R=16K$ unpaired data for training.
  • ...and 1 more figures

Theorems & Definitions (18)

  • Proposition 3.1: Tractable normalization constant
  • Proposition 3.2: Tractable conditional distributions
  • Theorem 3.3: Proposed parametrization guarantees universal conditional distributions
  • Proposition A.1: Gradient of our main loss \ref{['eq:loss']}
  • proof : Proof of Proposition \ref{['prop:norm-const']}
  • proof : Proof of Proposition \ref{['prop:cond-distr']}
  • proof : Proof of Proposition \ref{['prop:loss_grad']}
  • Lemma E.1: The Space $({\mathcal{P}}_1^+({\mathbb{R}}^D),d_{TV})$ is quantizable by Gaussian Mixtures
  • proof
  • Lemma E.2: The space $({\mathcal{P}}_1^+({\mathbb{R}}^D),d_{TV})$ is Approximate Simplicial
  • ...and 8 more