Table of Contents
Fetching ...

Bounds on Lp errors in density ratio estimation via f-divergence loss functions

Yoshiaki Kitazawa

TL;DR

The paper tackles the challenge of understanding how well density ratio estimation (DRE) can learn the true ratio when densities are learned via variational $f$-divergence losses. By establishing universal upper and lower bounds for the $L_p$ error that hold for Lipschitz estimators and are independent of the specific $f$-divergence, the authors reveal how data dimensionality and the KL divergence between $Q$ and $P$ jointly govern estimation accuracy. A key finding is that for $p>1$, the lower bound includes an exponential term in the KL divergence, implying the estimation error can grow rapidly as $KL(Q||P)$ increases, with this effect amplified by larger $p$. The results are supported by numerical experiments showing the predicted dependence on KL divergence and dimension, and they are framed through a mu-representation of the $f$-divergence loss that connects nearest-neighbor geometry to density-ratio estimation. This work offers theoretical guidance for selecting $f$-divergence losses and assessing sample complexity in high-dimensional DRE tasks, with practical implications for domain adaptation, generative modeling, and information-estimation methods that rely on accurate density ratios.

Abstract

Density ratio estimation (DRE) is a core technique in machine learning used to capture relationships between two probability distributions. $f$-divergence loss functions, which are derived from variational representations of $f$-divergence, have become a standard choice in DRE for achieving cutting-edge performance. This study provides novel theoretical insights into DRE by deriving upper and lower bounds on the $L_p$ errors through $f$-divergence loss functions. These bounds apply to any estimator belonging to a class of Lipschitz continuous estimators, irrespective of the specific $f$-divergence loss function employed. The derived bounds are expressed as a product involving the data dimensionality and the expected value of the density ratio raised to the $p$-th power. Notably, the lower bound includes an exponential term that depends on the Kullback--Leibler (KL) divergence, revealing that the $L_p$ error increases significantly as the KL divergence grows when $p > 1$. This increase becomes even more pronounced as the value of $p$ grows. The theoretical insights are validated through numerical experiments.

Bounds on Lp errors in density ratio estimation via f-divergence loss functions

TL;DR

The paper tackles the challenge of understanding how well density ratio estimation (DRE) can learn the true ratio when densities are learned via variational -divergence losses. By establishing universal upper and lower bounds for the error that hold for Lipschitz estimators and are independent of the specific -divergence, the authors reveal how data dimensionality and the KL divergence between and jointly govern estimation accuracy. A key finding is that for , the lower bound includes an exponential term in the KL divergence, implying the estimation error can grow rapidly as increases, with this effect amplified by larger . The results are supported by numerical experiments showing the predicted dependence on KL divergence and dimension, and they are framed through a mu-representation of the -divergence loss that connects nearest-neighbor geometry to density-ratio estimation. This work offers theoretical guidance for selecting -divergence losses and assessing sample complexity in high-dimensional DRE tasks, with practical implications for domain adaptation, generative modeling, and information-estimation methods that rely on accurate density ratios.

Abstract

Density ratio estimation (DRE) is a core technique in machine learning used to capture relationships between two probability distributions. -divergence loss functions, which are derived from variational representations of -divergence, have become a standard choice in DRE for achieving cutting-edge performance. This study provides novel theoretical insights into DRE by deriving upper and lower bounds on the errors through -divergence loss functions. These bounds apply to any estimator belonging to a class of Lipschitz continuous estimators, irrespective of the specific -divergence loss function employed. The derived bounds are expressed as a product involving the data dimensionality and the expected value of the density ratio raised to the -th power. Notably, the lower bound includes an exponential term that depends on the Kullback--Leibler (KL) divergence, revealing that the error increases significantly as the KL divergence grows when . This increase becomes even more pronounced as the value of grows. The theoretical insights are validated through numerical experiments.
Paper Structure (41 sections, 21 theorems, 164 equations, 5 figures, 2 tables)

This paper contains 41 sections, 21 theorems, 164 equations, 5 figures, 2 tables.

Key Result

Theorem 3.5

Assume $\Omega$ is a compact set in $\mathbb{R}^d$, where $d \ge 3$, and $f$ satisfies Assumption main_assumption_for_f. Let $P$ and $Q$ denote the probability measures on $\Omega$, and let $\phi$ represent a $K$-Lipschitz function that minimizes the $f$-divergence loss functions defined in Equation (Lower Bound) Assume Assumption main_assumption_lower: Equations (Eq_main_theorem_sample_requireme

Figures (5)

  • Figure 1: The experimental results of $L_p$ errors versus the magnitude of KL-divergence in the data are presented in Section \ref{['subsection_ExperimentalResults']}. The $x$-axis represents the magnitude of KL-divergence in synthetic datasets of fixed dimensionality. The $y$-axes of the left, center, and right graphs correspond to the $L_1$, $L_2$, and $L_3$ errors in DRE, respectively. The plots depict the median values of the $y$-axis, while the error bars indicate the interquartile range (25th to 75th percentiles). The blue line represents errors computed using the $\alpha$-divergence loss function, whereas the orange line corresponds to errors computed using the KL-divergence loss function.
  • Figure 2: The experimental results on $L_p$ errors versus the dimensionality of the data are presented in Section \ref{['subsection_ExperimentalResults']}. The top row displays results using the $\alpha$-divergence loss function, whereas the bottom row presents results using the KL-divergence loss function. The $x$-axis represents the logarithm of the number of samples utilized in the optimizations of DRE. The $y$-axes of the left, center, and right graphs correspond to the $L_1$, $L_2$, and $L_3$ errors in DRE, respectively. The plots show the median $y$-axis values, while the error bars represent the interquartile range (25th to 75th percentiles). The blue, orange, and green lines correspond to data dimensions of 50, 100, and 200, respectively.
  • Figure 3: The experimental results of $L_p$ errors versus the KL-divergence in the data for each multimodal case $M = 1, 2, 3$, and 4 of the numerator datasets are presented, as discussed in Sections \ref{['Section_MainResults']} and \ref{['Apdx_section_TheDetailsOfNumericalExperiments']}. The results for $M = 1$ were reported in Section \ref{['Section_MainResults']}. The $x$-axis represents the KL-divergence of synthetic datasets with fixed dimensions. The $y$-axes of the left, center, and right graphs represent the $L_1$, $L_2$, and $L_3$ errors in DRE, respectively. The blue line represents errors using the $\alpha$-divergence loss function, and the orange line represents errors using the KL-divergence loss function. The error bars denote the interquartile range (25th to 75th percentiles) of the $y$-axis values. The plots show the median $y$-axis values corresponding to the KL-divergence levels in the synthetic datasets.
  • Figure 4: The experimental results of $L_p$ errors versus the dimensionality of the data for the multimodal cases $M = 1$ and $2$ in the numerator datasets are presented, as discussed in Sections \ref{['Section_MainResults']} and \ref{['Apdx_section_TheDetailsOfNumericalExperiments']}. The results for $M = 1$ were reported in Section \ref{['Section_MainResults']}. The top row shows the results using the $\alpha$-divergence loss function, while the bottom row shows the results using the KL-divergence loss function. The $x$-axis represents the logarithm of the number of samples used for the optimizations for DRE. The $y$-axes of the left, center, and right graphs represent the $L_1$, $L_2$, and $L_3$ errors in DRE, respectively. The blue, orange, and green lines represent the results for data dimensionalities of 50, 100, and 200, respectively. The plots show the median $y$-axis values, and the error bars indicate the interquartile range (25th to 75th percentiles) of the $y$-axis values for the logarithm of the number of samples used in the optimizations for DRE.
  • Figure 5: The experimental results of $L_p$ errors versus the dimensionality of the data for the multimodal case $M = 3$ and 4 in the numerator datasets are presented, as discussed in Sections \ref{['Apdx_section_TheDetailsOfNumericalExperiments']}. The top row shows the results using the $\alpha$-divergence loss function, while the bottom row shows the results using the KL-divergence loss function. The $x$-axis represents the logarithm of the number of samples used for the optimizations for DRE. The $y$-axes of the left, center, and right graphs represent the $L_1$, $L_2$, and $L_3$ errors in DRE, respectively. Blue, orange, and green lines represent the results for data dimensionalities of 50, 100, and 200, respectively. The plots show the median $y$-axis values, and the error bars indicate the interquartile range (25th to 75th percentiles) of the $y$-axis values for the logarithm of the number of samples used in the optimizations for DRE.

Theorems & Definitions (42)

  • Definition 2.1: $f$-divergence
  • Definition 2.2: $f$-Divergence Loss
  • Theorem 3.5: Informal. See Theorem \ref{['main_theorem_sample_requirement']} and \ref{['main_theorem_sample_requirement_2']}
  • Definition 4.1: $\mu$-Representation $f$-Divergence Loss
  • Proposition 4.2
  • Theorem 4.3
  • Theorem 4.4
  • Theorem 4.5
  • Remark 4.6
  • Theorem 4.7
  • ...and 32 more