Towards Healing the Blindness of Score Matching

Mingtian Zhang; Oscar Key; Peter Hayes; David Barber; Brooks Paige; François-Xavier Briol

Towards Healing the Blindness of Score Matching

Mingtian Zhang, Oscar Key, Peter Hayes, David Barber, Brooks Paige, François-Xavier Briol

TL;DR

The paper tackles the blindness of score-based divergences in multi-modal settings by introducing the Mixture Fisher Divergence (MFD), which bridges disconnected supports with a mixing density $m$ and mixing weight $\beta$. It proves MFD is a valid divergence on disconnected supports and demonstrates its practical benefits in density estimation by first training a bridged model on $\tilde{p}_d=\beta p_d+(1-\beta)m$ and then applying a correction step to recover the true density. In the energy-based-model setting, the authors propose a three-step pipeline that avoids explicit normalization during training, yielding substantial improvements in KL divergence over standard FD on mixtures of Gaussians and concentric circles. Overall, MFD provides a principled way to extend score-based divergences to multi-modal distributions, offering a practical path to more robust density estimation and related score-based tasks.

Abstract

Score-based divergences have been widely used in machine learning and statistics applications. Despite their empirical success, a blindness problem has been observed when using these for multi-modal distributions. In this work, we discuss the blindness problem and propose a new family of divergences that can mitigate the blindness problem. We illustrate our proposed divergence in the context of density estimation and report improved performance compared to traditional approaches.

Towards Healing the Blindness of Score Matching

TL;DR

The paper tackles the blindness of score-based divergences in multi-modal settings by introducing the Mixture Fisher Divergence (MFD), which bridges disconnected supports with a mixing density

and mixing weight

. It proves MFD is a valid divergence on disconnected supports and demonstrates its practical benefits in density estimation by first training a bridged model on

and then applying a correction step to recover the true density. In the energy-based-model setting, the authors propose a three-step pipeline that avoids explicit normalization during training, yielding substantial improvements in KL divergence over standard FD on mixtures of Gaussians and concentric circles. Overall, MFD provides a principled way to extend score-based divergences to multi-modal distributions, offering a practical path to more robust density estimation and related score-based tasks.

Abstract

Paper Structure (15 sections, 6 theorems, 23 equations, 5 figures)

This paper contains 15 sections, 6 theorems, 23 equations, 5 figures.

Introduction
Understanding the Blindness Problem
Healing the Blindness Problem with the Mixture Fisher Divergence
Density Estimation with Energy-based Models
Related Work
Derivations and Proofs
Derivation of Equation \ref{['eq:disjoint']}
Proof of Theorem \ref{['theorem:connected']}
Proof of Theorem \ref{['theo:ill:defined']}
Derivation of Score Matching
Kernelized Stein Discrepancy Extensions
Proof of Theorem \ref{['theo:mdf']}
Experiment Details
Data Noise Annealing Doesn't Help
Spread Fisher Divergence

Key Result

Theorem 1

Assume two distributions (i) have differentiable densities $p$ and $q$ with support on a common open connected set $\mathcal{X}\subseteq {\mathbb{R}}^d$ and (ii) $s_p-s_q\in L^2(p)$. Then, the FD is a valid divergence i.e. ${\mathrm{FD}}(p||q)=0 \Leftrightarrow p=q$.

Figures (5)

Figure 1: We plot the densities and score functions of distributions $p$ and $q$ in Figure (a) and (b). Figure (c) shows ${\mathrm{FD}}(p||q)$ with $\alpha_p=0.2$ and $\alpha_q$ varies from $0.01$ to $0.09$ with a grid size $0.01$.
Figure 2: We plot the densities (a) and the score functions (b) of $\tilde{p}$ and $\tilde{q}$. Figure (c) shows $\mathrm{MFD}(p||q)$ with $\alpha_p=0.2$ and $\alpha_q$ varies from $0.0$ to $1.0$ with a grid size $0.01$. The star mark shows the minima of the MFD is achieved when $\alpha_q=\alpha_p=0.2$, we also plot the original FD for a comparison.
Figure 3: Density estimation comparisons with FD and MFD for the energy-based model. The $\mathrm{KL}(p_d||p_\theta)$ evaluations are 3.52/0.22 (b/e) for FD and 0.17/0.01 (c/f) for MFD, lower is better.
Figure 4: FD with training data noise annealing
Figure 5: Density Estimation with FD and MFD.

Theorems & Definitions (8)

Theorem 1: FD on a connected set
Theorem 2: FD is ill-defined on disconnected sets
Theorem 3: Validity of the MFD
Lemma 4
proof
Lemma 5
proof
Proposition 1: FD is ill-defined on disconnected sets

Towards Healing the Blindness of Score Matching

TL;DR

Abstract

Towards Healing the Blindness of Score Matching

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (8)