Learning Energy-Based Models by Self-normalising the Likelihood
Hugo Senetaire, Paul Jeha, Pierre-Alexandre Mattei, Jes Frellsen
TL;DR
The paper tackles the challenge of training energy-based models with intractable normalisation constants by introducing the self-normalised log-likelihood (SNL). SNL adds a single learnable parameter $b$ so that maximizing $\ell_{\mathrm{SNL}}(\theta,b)$ yields the maximum-likelihood solution and recovers $\log Z_{\theta}$, while enabling unbiased gradient estimates via a proposal distribution. It proves concavity for exponential families and offers an information-theoretic perspective linking SNL to a generalized KL divergence; the framework extends to regression via $b_{\phi}(x)$ and to VAEs through SNELBO. Empirically, SNL-based EBMs outperform traditional methods on density estimation and regression tasks, and SNELBO enables VAEs with latent-EBM priors to achieve improved objective scores, all while maintaining simplicity and stability of training.
Abstract
Training an energy-based model (EBM) with maximum likelihood is challenging due to the intractable normalisation constant. Traditional methods rely on expensive Markov chain Monte Carlo (MCMC) sampling to estimate the gradient of logartihm of the normalisation constant. We propose a novel objective called self-normalised log-likelihood (SNL) that introduces a single additional learnable parameter representing the normalisation constant compared to the regular log-likelihood. SNL is a lower bound of the log-likelihood, and its optimum corresponds to both the maximum likelihood estimate of the model parameters and the normalisation constant. We show that the SNL objective is concave in the model parameters for exponential family distributions. Unlike the regular log-likelihood, the SNL can be directly optimised using stochastic gradient techniques by sampling from a crude proposal distribution. We validate the effectiveness of our proposed method on various density estimation tasks as well as EBMs for regression. Our results show that the proposed method, while simpler to implement and tune, outperforms existing techniques.
