Table of Contents
Fetching ...

Towards Healing the Blindness of Score Matching

Mingtian Zhang, Oscar Key, Peter Hayes, David Barber, Brooks Paige, François-Xavier Briol

TL;DR

The paper tackles the blindness of score-based divergences in multi-modal settings by introducing the Mixture Fisher Divergence (MFD), which bridges disconnected supports with a mixing density $m$ and mixing weight $\beta$. It proves MFD is a valid divergence on disconnected supports and demonstrates its practical benefits in density estimation by first training a bridged model on $\tilde{p}_d=\beta p_d+(1-\beta)m$ and then applying a correction step to recover the true density. In the energy-based-model setting, the authors propose a three-step pipeline that avoids explicit normalization during training, yielding substantial improvements in KL divergence over standard FD on mixtures of Gaussians and concentric circles. Overall, MFD provides a principled way to extend score-based divergences to multi-modal distributions, offering a practical path to more robust density estimation and related score-based tasks.

Abstract

Score-based divergences have been widely used in machine learning and statistics applications. Despite their empirical success, a blindness problem has been observed when using these for multi-modal distributions. In this work, we discuss the blindness problem and propose a new family of divergences that can mitigate the blindness problem. We illustrate our proposed divergence in the context of density estimation and report improved performance compared to traditional approaches.

Towards Healing the Blindness of Score Matching

TL;DR

The paper tackles the blindness of score-based divergences in multi-modal settings by introducing the Mixture Fisher Divergence (MFD), which bridges disconnected supports with a mixing density and mixing weight . It proves MFD is a valid divergence on disconnected supports and demonstrates its practical benefits in density estimation by first training a bridged model on and then applying a correction step to recover the true density. In the energy-based-model setting, the authors propose a three-step pipeline that avoids explicit normalization during training, yielding substantial improvements in KL divergence over standard FD on mixtures of Gaussians and concentric circles. Overall, MFD provides a principled way to extend score-based divergences to multi-modal distributions, offering a practical path to more robust density estimation and related score-based tasks.

Abstract

Score-based divergences have been widely used in machine learning and statistics applications. Despite their empirical success, a blindness problem has been observed when using these for multi-modal distributions. In this work, we discuss the blindness problem and propose a new family of divergences that can mitigate the blindness problem. We illustrate our proposed divergence in the context of density estimation and report improved performance compared to traditional approaches.
Paper Structure (15 sections, 6 theorems, 23 equations, 5 figures)

This paper contains 15 sections, 6 theorems, 23 equations, 5 figures.

Key Result

Theorem 1

Assume two distributions (i) have differentiable densities $p$ and $q$ with support on a common open connected set $\mathcal{X}\subseteq {\mathbb{R}}^d$ and (ii) $s_p-s_q\in L^2(p)$. Then, the FD is a valid divergence i.e. ${\mathrm{FD}}(p||q)=0 \Leftrightarrow p=q$.

Figures (5)

  • Figure 1: We plot the densities and score functions of distributions $p$ and $q$ in Figure (a) and (b). Figure (c) shows ${\mathrm{FD}}(p||q)$ with $\alpha_p=0.2$ and $\alpha_q$ varies from $0.01$ to $0.09$ with a grid size $0.01$.
  • Figure 2: We plot the densities (a) and the score functions (b) of $\tilde{p}$ and $\tilde{q}$. Figure (c) shows $\mathrm{MFD}(p||q)$ with $\alpha_p=0.2$ and $\alpha_q$ varies from $0.0$ to $1.0$ with a grid size $0.01$. The star mark shows the minima of the MFD is achieved when $\alpha_q=\alpha_p=0.2$, we also plot the original FD for a comparison.
  • Figure 3: Density estimation comparisons with FD and MFD for the energy-based model. The $\mathrm{KL}(p_d||p_\theta)$ evaluations are 3.52/0.22 (b/e) for FD and 0.17/0.01 (c/f) for MFD, lower is better.
  • Figure 4: FD with training data noise annealing
  • Figure 5: Density Estimation with FD and MFD.

Theorems & Definitions (8)

  • Theorem 1: FD on a connected set
  • Theorem 2: FD is ill-defined on disconnected sets
  • Theorem 3: Validity of the MFD
  • Lemma 4
  • proof
  • Lemma 5
  • proof
  • Proposition 1: FD is ill-defined on disconnected sets