Bounds on Lp errors in density ratio estimation via f-divergence loss functions

Yoshiaki Kitazawa

Bounds on Lp errors in density ratio estimation via f-divergence loss functions

Yoshiaki Kitazawa

TL;DR

The paper tackles the challenge of understanding how well density ratio estimation (DRE) can learn the true ratio when densities are learned via variational $f$-divergence losses. By establishing universal upper and lower bounds for the $L_p$ error that hold for Lipschitz estimators and are independent of the specific $f$-divergence, the authors reveal how data dimensionality and the KL divergence between $Q$ and $P$ jointly govern estimation accuracy. A key finding is that for $p>1$, the lower bound includes an exponential term in the KL divergence, implying the estimation error can grow rapidly as $KL(Q||P)$ increases, with this effect amplified by larger $p$. The results are supported by numerical experiments showing the predicted dependence on KL divergence and dimension, and they are framed through a mu-representation of the $f$-divergence loss that connects nearest-neighbor geometry to density-ratio estimation. This work offers theoretical guidance for selecting $f$-divergence losses and assessing sample complexity in high-dimensional DRE tasks, with practical implications for domain adaptation, generative modeling, and information-estimation methods that rely on accurate density ratios.

Abstract

Density ratio estimation (DRE) is a core technique in machine learning used to capture relationships between two probability distributions. $f$-divergence loss functions, which are derived from variational representations of $f$-divergence, have become a standard choice in DRE for achieving cutting-edge performance. This study provides novel theoretical insights into DRE by deriving upper and lower bounds on the $L_p$ errors through $f$-divergence loss functions. These bounds apply to any estimator belonging to a class of Lipschitz continuous estimators, irrespective of the specific $f$-divergence loss function employed. The derived bounds are expressed as a product involving the data dimensionality and the expected value of the density ratio raised to the $p$-th power. Notably, the lower bound includes an exponential term that depends on the Kullback--Leibler (KL) divergence, revealing that the $L_p$ error increases significantly as the KL divergence grows when $p > 1$. This increase becomes even more pronounced as the value of $p$ grows. The theoretical insights are validated through numerical experiments.

Bounds on Lp errors in density ratio estimation via f-divergence loss functions

TL;DR

The paper tackles the challenge of understanding how well density ratio estimation (DRE) can learn the true ratio when densities are learned via variational

-divergence losses. By establishing universal upper and lower bounds for the

error that hold for Lipschitz estimators and are independent of the specific

-divergence, the authors reveal how data dimensionality and the KL divergence between

and

jointly govern estimation accuracy. A key finding is that for

, the lower bound includes an exponential term in the KL divergence, implying the estimation error can grow rapidly as

increases, with this effect amplified by larger

. The results are supported by numerical experiments showing the predicted dependence on KL divergence and dimension, and they are framed through a mu-representation of the

-divergence loss that connects nearest-neighbor geometry to density-ratio estimation. This work offers theoretical guidance for selecting

-divergence losses and assessing sample complexity in high-dimensional DRE tasks, with practical implications for domain adaptation, generative modeling, and information-estimation methods that rely on accurate density ratios.

Abstract

Density ratio estimation (DRE) is a core technique in machine learning used to capture relationships between two probability distributions.

-divergence loss functions, which are derived from variational representations of

-divergence, have become a standard choice in DRE for achieving cutting-edge performance. This study provides novel theoretical insights into DRE by deriving upper and lower bounds on the

errors through

-divergence loss functions. These bounds apply to any estimator belonging to a class of Lipschitz continuous estimators, irrespective of the specific

-divergence loss function employed. The derived bounds are expressed as a product involving the data dimensionality and the expected value of the density ratio raised to the

-th power. Notably, the lower bound includes an exponential term that depends on the Kullback--Leibler (KL) divergence, revealing that the

error increases significantly as the KL divergence grows when

. This increase becomes even more pronounced as the value of

grows. The theoretical insights are validated through numerical experiments.

Paper Structure (41 sections, 21 theorems, 164 equations, 5 figures, 2 tables)

This paper contains 41 sections, 21 theorems, 164 equations, 5 figures, 2 tables.

Introduction
Related Work.
Preliminaries: Notation, Setup, and f-Divergence Loss Functions
Notation, Preliminary Concepts, and Setup
DRE with f-divergence variational representation
Main Results
Theoretical Results.
Experimental Results.
Lp Errors vs. the KL-Divergence in Data
Lp Errors vs. the Dimensions of Data
Overview of Upper and Lower Bound Derivations
Conceptual reformulation of the f-divergence loss functions
Derivation of Upper and Lower Bounds for Optimal Functions of the μ-Representation f-Divergence Loss Functions
Derivation of Upper and Lower Bounds for Optimal Functions of the f-Divergence Loss Functions
Conclusions
...and 26 more sections

Key Result

Theorem 3.5

Assume $\Omega$ is a compact set in $\mathbb{R}^d$, where $d \ge 3$, and $f$ satisfies Assumption main_assumption_for_f. Let $P$ and $Q$ denote the probability measures on $\Omega$, and let $\phi$ represent a $K$-Lipschitz function that minimizes the $f$-divergence loss functions defined in Equation (Lower Bound) Assume Assumption main_assumption_lower: Equations (Eq_main_theorem_sample_requireme

Figures (5)

Figure 1: The experimental results of $L_p$ errors versus the magnitude of KL-divergence in the data are presented in Section \ref{['subsection_ExperimentalResults']}. The $x$-axis represents the magnitude of KL-divergence in synthetic datasets of fixed dimensionality. The $y$-axes of the left, center, and right graphs correspond to the $L_1$, $L_2$, and $L_3$ errors in DRE, respectively. The plots depict the median values of the $y$-axis, while the error bars indicate the interquartile range (25th to 75th percentiles). The blue line represents errors computed using the $\alpha$-divergence loss function, whereas the orange line corresponds to errors computed using the KL-divergence loss function.
Figure 2: The experimental results on $L_p$ errors versus the dimensionality of the data are presented in Section \ref{['subsection_ExperimentalResults']}. The top row displays results using the $\alpha$-divergence loss function, whereas the bottom row presents results using the KL-divergence loss function. The $x$-axis represents the logarithm of the number of samples utilized in the optimizations of DRE. The $y$-axes of the left, center, and right graphs correspond to the $L_1$, $L_2$, and $L_3$ errors in DRE, respectively. The plots show the median $y$-axis values, while the error bars represent the interquartile range (25th to 75th percentiles). The blue, orange, and green lines correspond to data dimensions of 50, 100, and 200, respectively.
Figure 3: The experimental results of $L_p$ errors versus the KL-divergence in the data for each multimodal case $M = 1, 2, 3$, and 4 of the numerator datasets are presented, as discussed in Sections \ref{['Section_MainResults']} and \ref{['Apdx_section_TheDetailsOfNumericalExperiments']}. The results for $M = 1$ were reported in Section \ref{['Section_MainResults']}. The $x$-axis represents the KL-divergence of synthetic datasets with fixed dimensions. The $y$-axes of the left, center, and right graphs represent the $L_1$, $L_2$, and $L_3$ errors in DRE, respectively. The blue line represents errors using the $\alpha$-divergence loss function, and the orange line represents errors using the KL-divergence loss function. The error bars denote the interquartile range (25th to 75th percentiles) of the $y$-axis values. The plots show the median $y$-axis values corresponding to the KL-divergence levels in the synthetic datasets.
Figure 4: The experimental results of $L_p$ errors versus the dimensionality of the data for the multimodal cases $M = 1$ and $2$ in the numerator datasets are presented, as discussed in Sections \ref{['Section_MainResults']} and \ref{['Apdx_section_TheDetailsOfNumericalExperiments']}. The results for $M = 1$ were reported in Section \ref{['Section_MainResults']}. The top row shows the results using the $\alpha$-divergence loss function, while the bottom row shows the results using the KL-divergence loss function. The $x$-axis represents the logarithm of the number of samples used for the optimizations for DRE. The $y$-axes of the left, center, and right graphs represent the $L_1$, $L_2$, and $L_3$ errors in DRE, respectively. The blue, orange, and green lines represent the results for data dimensionalities of 50, 100, and 200, respectively. The plots show the median $y$-axis values, and the error bars indicate the interquartile range (25th to 75th percentiles) of the $y$-axis values for the logarithm of the number of samples used in the optimizations for DRE.
Figure 5: The experimental results of $L_p$ errors versus the dimensionality of the data for the multimodal case $M = 3$ and 4 in the numerator datasets are presented, as discussed in Sections \ref{['Apdx_section_TheDetailsOfNumericalExperiments']}. The top row shows the results using the $\alpha$-divergence loss function, while the bottom row shows the results using the KL-divergence loss function. The $x$-axis represents the logarithm of the number of samples used for the optimizations for DRE. The $y$-axes of the left, center, and right graphs represent the $L_1$, $L_2$, and $L_3$ errors in DRE, respectively. Blue, orange, and green lines represent the results for data dimensionalities of 50, 100, and 200, respectively. The plots show the median $y$-axis values, and the error bars indicate the interquartile range (25th to 75th percentiles) of the $y$-axis values for the logarithm of the number of samples used in the optimizations for DRE.

Theorems & Definitions (42)

Definition 2.1: $f$-divergence
Definition 2.2: $f$-Divergence Loss
Theorem 3.5: Informal. See Theorem \ref{['main_theorem_sample_requirement']} and \ref{['main_theorem_sample_requirement_2']}
Definition 4.1: $\mu$-Representation $f$-Divergence Loss
Proposition 4.2
Theorem 4.3
Theorem 4.4
Theorem 4.5
Remark 4.6
Theorem 4.7
...and 32 more

Bounds on Lp errors in density ratio estimation via f-divergence loss functions

TL;DR

Abstract

Bounds on Lp errors in density ratio estimation via f-divergence loss functions

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (42)