Alpha-divergence loss function for neural density ratio estimation
Yoshiaki Kitazawa
TL;DR
The paper tackles density ratio estimation with neural networks by exploiting variational representations of $f$-divergences. It identifies core optimization issues—train-loss hacking, biased gradients, vanishing gradients, and KL-sample-size concerns—and introduces a novel $\alpha$-divergence loss ($\alpha$-Div) derived via a Gibbs-density reformulation to yield unbiased, bounded gradients. Theoretical results show when and why $0<\alpha<1$ mitigates major problems and provide a formal sample-size bound for $\alpha$-Div. Empirical results demonstrate stable optimization and unbiased gradients, though performance under high KL-divergence data is mainly dictated by the KL content rather than the specific $\alpha$ choice. Overall, $\alpha$-Div offers a principled, robust alternative for DRE with neural nets, while highlighting intrinsic limits tied to data KL-divergence.
Abstract
Density ratio estimation (DRE) is a fundamental machine learning technique for capturing relationships between two probability distributions. State-of-the-art DRE methods estimate the density ratio using neural networks trained with loss functions derived from variational representations of $f$-divergences. However, existing methods face optimization challenges, such as overfitting due to lower-unbounded loss functions, biased mini-batch gradients, vanishing training loss gradients, and high sample requirements for Kullback--Leibler (KL) divergence loss functions. To address these issues, we focus on $α$-divergence, which provides a suitable variational representation of $f$-divergence. Subsequently, a novel loss function for DRE, the $α$-divergence loss function ($α$-Div), is derived. $α$-Div is concise but offers stable and effective optimization for DRE. The boundedness of $α$-divergence provides the potential for successful DRE with data exhibiting high KL-divergence. Our numerical experiments demonstrate the effectiveness of $α$-Div in optimization. However, the experiments also show that the proposed loss function offers no significant advantage over the KL-divergence loss function in terms of RMSE for DRE. This indicates that the accuracy of DRE is primarily determined by the amount of KL-divergence in the data and is less dependent on $α$-divergence.
