Alpha-divergence loss function for neural density ratio estimation

Yoshiaki Kitazawa

Alpha-divergence loss function for neural density ratio estimation

Yoshiaki Kitazawa

TL;DR

The paper tackles density ratio estimation with neural networks by exploiting variational representations of $f$-divergences. It identifies core optimization issues—train-loss hacking, biased gradients, vanishing gradients, and KL-sample-size concerns—and introduces a novel $\alpha$-divergence loss ($\alpha$-Div) derived via a Gibbs-density reformulation to yield unbiased, bounded gradients. Theoretical results show when and why $0<\alpha<1$ mitigates major problems and provide a formal sample-size bound for $\alpha$-Div. Empirical results demonstrate stable optimization and unbiased gradients, though performance under high KL-divergence data is mainly dictated by the KL content rather than the specific $\alpha$ choice. Overall, $\alpha$-Div offers a principled, robust alternative for DRE with neural nets, while highlighting intrinsic limits tied to data KL-divergence.

Abstract

Density ratio estimation (DRE) is a fundamental machine learning technique for capturing relationships between two probability distributions. State-of-the-art DRE methods estimate the density ratio using neural networks trained with loss functions derived from variational representations of $f$-divergences. However, existing methods face optimization challenges, such as overfitting due to lower-unbounded loss functions, biased mini-batch gradients, vanishing training loss gradients, and high sample requirements for Kullback--Leibler (KL) divergence loss functions. To address these issues, we focus on $α$-divergence, which provides a suitable variational representation of $f$-divergence. Subsequently, a novel loss function for DRE, the $α$-divergence loss function ($α$-Div), is derived. $α$-Div is concise but offers stable and effective optimization for DRE. The boundedness of $α$-divergence provides the potential for successful DRE with data exhibiting high KL-divergence. Our numerical experiments demonstrate the effectiveness of $α$-Div in optimization. However, the experiments also show that the proposed loss function offers no significant advantage over the KL-divergence loss function in terms of RMSE for DRE. This indicates that the accuracy of DRE is primarily determined by the amount of KL-divergence in the data and is less dependent on $α$-divergence.

Alpha-divergence loss function for neural density ratio estimation

TL;DR

The paper tackles density ratio estimation with neural networks by exploiting variational representations of

-divergences. It identifies core optimization issues—train-loss hacking, biased gradients, vanishing gradients, and KL-sample-size concerns—and introduces a novel

-divergence loss (

-Div) derived via a Gibbs-density reformulation to yield unbiased, bounded gradients. Theoretical results show when and why

mitigates major problems and provide a formal sample-size bound for

-Div. Empirical results demonstrate stable optimization and unbiased gradients, though performance under high KL-divergence data is mainly dictated by the KL content rather than the specific

choice. Overall,

-Div offers a principled, robust alternative for DRE with neural nets, while highlighting intrinsic limits tied to data KL-divergence.

Abstract

-divergences. However, existing methods face optimization challenges, such as overfitting due to lower-unbounded loss functions, biased mini-batch gradients, vanishing training loss gradients, and high sample requirements for Kullback--Leibler (KL) divergence loss functions. To address these issues, we focus on

-divergence, which provides a suitable variational representation of

-divergence. Subsequently, a novel loss function for DRE, the

-divergence loss function (

-Div), is derived.

-Div is concise but offers stable and effective optimization for DRE. The boundedness of

-divergence provides the potential for successful DRE with data exhibiting high KL-divergence. Our numerical experiments demonstrate the effectiveness of

-Div in optimization. However, the experiments also show that the proposed loss function offers no significant advantage over the KL-divergence loss function in terms of RMSE for DRE. This indicates that the accuracy of DRE is primarily determined by the amount of KL-divergence in the data and is less dependent on

-divergence.

Paper Structure (64 sections, 12 theorems, 90 equations, 6 figures, 7 tables, 1 algorithm)

This paper contains 64 sections, 12 theorems, 90 equations, 6 figures, 7 tables, 1 algorithm.

Introduction
Problem Setup
DRE via f-divergence variational representations and its major problems
DRE via f-divergence variational representation
Train-loss hacking problem
Biased gradient problem
Vanishing gradient problem
Sample size requirement problem for KL-divergence
DRE using a neural network with an α-divergence loss
Derivation of our loss function for DRE
Training and predicting with α-Div
Theoretical results for the proposed loss function
Addressing the train-loss hacking problem
Unbiasedness of gradients
Addressing gradient vanishing problem
...and 49 more sections

Key Result

Theorem 4.1

A variational representation of $\alpha$-divergence is given as where the supremum is taken over all measurable functions satisfying $E_P[\phi^{1 - \alpha}] < \infty$ and $E_Q[\phi^{-\alpha}] < \infty$. The maximum value is achieved at $\phi(\mathbf{x})=q(\mathbf{x})/p(\mathbf{x})$.

Figures (6)

Figure 1: Results from Section \ref{['subsection_Experimentsonthestabilityinoptimizationfordifferentvaluesofalpha']}. The left ($\alpha = -2.0$), center ($\alpha = 3.0$), and right ($\alpha = 0.5$) graphs show training losses ($y$-axis) over learning steps ($x$-axis) during optimization using $\alpha$-Div with different $\alpha$ values. Solid blue lines represent median training losses, dark blue shaded areas show the 45th to 55th percentiles, and light blue shaded areas represent the 2.5th to 97.5th percentiles.
Figure 2: Results from Section \ref{['subsection_Experimentsonimprovementofoptimizationefficiencybyremovinggradientbias']}. The top row shows training losses, and the bottom row shows estimated density ratios (DR) during optimization. The left column uses the standard $\alpha$-divergence loss function (biased gradients), the center column uses $\alpha$-Div (unbiased gradients), and the right column uses nnBD-LSIF (unbiased gradients). The $x$-axis represents learning steps. Solid blue lines indicate median values, dark blue shaded areas show the 45th to 55th percentiles, and light blue shaded areas represent the 2.5th to 97.5th percentiles.
Figure 3: Results of Section \ref{['subsection_ExperimentsontheestimationaccuracyusinghighKLdivergencedata']}. The $x$-axis represents the ground truth KL-divergence of the data. The $y$-axes of the left and right graphs represent the RMSE and estimated KL-divergence, respectively. The plot shows the median $y$-axis values for the ground truth KL-divergence. Vertical lines indicate the interquartile range (25th to 75th percentiles) of the $y$-axis values.
Figure 4: All results of Section \ref{['subsection_Experimentsonthestabilityinoptimizationfordifferentvaluesofalpha']}. for $\alpha=-3, -2, -1, 0.2, 0.5, 0.8, 2.0, 3.0$, and $4.0$. Each graph displays the training losses ($y$-axis) against the learning steps ($x$-axis) during optimization using $\alpha$-Div for different values of $\alpha$. The solid blue line represents the median training losses. The dark blue area indicates the range between the 45th and 55th percentiles, while the light blue area shows the range between the 2.5th and 97.5th percentiles of the training losses.
Figure 5: Results of Section \ref{['Apdx_subsection_AdditionalexperimentsExperimentsUsingRealWorldData']} for LogisticRegression. In the figure titles, domain names at the origin of the arrows indicate the source domains, while those at the tip represent the target domains. The $x$-axis shows the number of features, and the $y$-axis represents the ROC AUC for the domain adaptation tasks. The orange line (SO) denotes models trained using source-only data (i.e., models trained on source data only, without importance weighting), whereas the blue line (IW) represents models trained using source data with importance weighting.
...and 1 more figures

Theorems & Definitions (22)

Definition 3.1: $f$-divergence
Theorem 4.1
Theorem 4.2
Definition 4.3: $\alpha$-Div
Theorem 4.4
Theorem 5.1: Informal statement
Theorem 5.2
Definition C.1: $\alpha$-Divergence loss
Theorem C.2
proof : Proof of Theorem \ref{['Apdx_theorem_alpha_div_resp']}
...and 12 more

Alpha-divergence loss function for neural density ratio estimation

TL;DR

Abstract

Alpha-divergence loss function for neural density ratio estimation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (22)