Table of Contents
Fetching ...

Binary Losses for Density Ratio Estimation

Werner Zellinger

TL;DR

This work addresses how binary loss functions used for density-ratio estimation influence the resulting error via a Bregman-divergence objective. It derives a complete characterization showing that any strictly proper composite loss compatible with a prescribed $B_\phi$ must have a specific form, enabling construction of convex losses that emphasize large density-ratio values. The authors introduce novel loss families, including an exponential-weight and polynomial-weight class, which better prioritize large $\beta$ values and improve performance in deep domain adaptation, as evidenced by extensive experiments over 484 real-world tasks and 9174 trained networks. The results demonstrate practical impact for importance weighting and parameter selection in domain adaptation, while also highlighting open questions about theoretical sample complexity.

Abstract

Estimating the ratio of two probability densities from a finite number of observations is a central machine learning problem. A common approach is to construct estimators using binary classifiers that distinguish observations from the two densities. However, the accuracy of these estimators depends on the choice of the binary loss function, raising the question of which loss function to choose based on desired error properties. For example, traditional loss functions, such as logistic or boosting loss, prioritize accurate estimation of small density ratio values over large ones, even though the latter are more critical in many applications. In this work, we start with prescribed error measures in a class of Bregman divergences and characterize all loss functions that result in density ratio estimators with small error. Our characterization extends results on composite binary losses from (Reid & Williamson, 2010) and their connection to density ratio estimation as identified by (Menon & Ong, 2016). As a result, we obtain a simple recipe for constructing loss functions with certain properties, such as those that prioritize an accurate estimation of large density ratio values. Our novel loss functions outperform related approaches for resolving parameter choice issues of 11 deep domain adaptation algorithms in average performance across 484 real-world tasks including sensor signals, texts, and images.

Binary Losses for Density Ratio Estimation

TL;DR

This work addresses how binary loss functions used for density-ratio estimation influence the resulting error via a Bregman-divergence objective. It derives a complete characterization showing that any strictly proper composite loss compatible with a prescribed must have a specific form, enabling construction of convex losses that emphasize large density-ratio values. The authors introduce novel loss families, including an exponential-weight and polynomial-weight class, which better prioritize large values and improve performance in deep domain adaptation, as evidenced by extensive experiments over 484 real-world tasks and 9174 trained networks. The results demonstrate practical impact for importance weighting and parameter selection in domain adaptation, while also highlighting open questions about theoretical sample complexity.

Abstract

Estimating the ratio of two probability densities from a finite number of observations is a central machine learning problem. A common approach is to construct estimators using binary classifiers that distinguish observations from the two densities. However, the accuracy of these estimators depends on the choice of the binary loss function, raising the question of which loss function to choose based on desired error properties. For example, traditional loss functions, such as logistic or boosting loss, prioritize accurate estimation of small density ratio values over large ones, even though the latter are more critical in many applications. In this work, we start with prescribed error measures in a class of Bregman divergences and characterize all loss functions that result in density ratio estimators with small error. Our characterization extends results on composite binary losses from (Reid & Williamson, 2010) and their connection to density ratio estimation as identified by (Menon & Ong, 2016). As a result, we obtain a simple recipe for constructing loss functions with certain properties, such as those that prioritize an accurate estimation of large density ratio values. Our novel loss functions outperform related approaches for resolving parameter choice issues of 11 deep domain adaptation algorithms in average performance across 484 real-world tasks including sensor signals, texts, and images.
Paper Structure (24 sections, 10 theorems, 100 equations, 3 figures, 34 tables, 1 algorithm)

This paper contains 24 sections, 10 theorems, 100 equations, 3 figures, 34 tables, 1 algorithm.

Key Result

Lemma 1

Let $\ell:\{-1,1\}\times\mathbb{R}\to \mathbb{R}$ be a strictly proper composite loss function with invertible link function $\Psi:[0,1]\to\mathbb{R}$ and twice differentiable negative Bayes risk $-\mathop{\mathrm{\underline{L}}}\nolimits:[0,1]\to\mathbb{R}$. Then with

Figures (3)

  • Figure 1: Left: Two piecewise constant probability measures $P$ and $Q$ on $[-1,1]$. Right: Estimators $\widehat{\beta}(x):= k x^2 + d$ of density ratio $\beta=\frac{\mathop{}\!\mathrm{d} P}{\mathop{}\!\mathrm{d} Q}$ with $k,d>0$ computed by Algorithm \ref{['alg:dre_by_cpe']} (for $m+n\to\infty$). Compared to LR bickel2009discriminative and KuLSIF kanamori2009least, our estimators originate from error measures $B_\phi$ in Eq. \ref{['eq:Bregman_divergence_introduction']} with increasing weight functions $\phi"(c)=c,c^6,e^{2c}$ and consequently obtain better estimates for large values $\beta(x), x\in[-1,-0.9]\cup[0.9,1]$, see Section \ref{['sec:large_density_ratio_estimation']}.
  • Figure 2: Estimation of density ratio (left, black) in Gaussian RKHS with Tikhonov penalty weighted by $\alpha$ for sample sizes $m+n=10$ (top right) and $m+n=100$ (lower right). Our loss function (blue) prioritizes accurate estimation of larger values and consequently produces flatter curves then KuLSIF (green).
  • Figure 3: Using density ratio estimators from Figure \ref{['fig:introduction']} as sample weights for polynomial kernel least squares regression. Left: Bayes predictor (black, dashed) and observations ($\times$) from $Q$. Middle: regressors (blue: uniform weighting $\widehat{\beta}\equiv 1$, green: exact $\beta$, red: our estimate with $\phi"(c)=e^{2c}$ from Figure \ref{['fig:introduction']}, yellow: LR estimate from Figure \ref{['fig:introduction']}). Right: Pointwise errors for target observations from $P$. Exponential weight (ours) is higher for larger density ratio values; consequently achieves smaller errors on $[0.9,1]$.

Theorems & Definitions (24)

  • Example 1
  • Lemma 1: menon2016linking
  • Theorem 1
  • Remark 1: on Novelty
  • Remark 2
  • Corollary 1
  • Remark 3
  • Lemma 2: reid2009surrogate
  • proof : Proof of Lemma \ref{['lemma:weight_representation']}
  • Theorem 2: shuford1966admissible
  • ...and 14 more