Sourcerer: Sample-based Maximum Entropy Source Distribution Estimation
Julius Vetter, Guy Moss, Cornelius Schröder, Richard Gao, Jakob H. Macke
TL;DR
Sourcerer tackles the ill-posed problem of source distribution estimation by adopting a maximum-entropy principle to select a unique source distribution $q(\theta)$ that reproduces the observed data distribution $p_o(x)$ when pushed through a simulator $p(x|\theta)$. It introduces a purely sample-based, differentiable framework that maximizes $H(q)$ while enforcing data-consistency via a distance $D(q^{\#}, p_o)$, implemented with a regularized objective $\max_\phi \lambda H(q_\phi)-(1-\lambda)\log(D(q^{\#}_\phi, p_o))$ and a dynamic schedule for $\lambda$. The method relies on neural samplers for $q_\phi$, and uses the Kozachenko-Leonenko entropy estimator to remain fully sample-based; the distance metric is the Sliced-Wasserstein distance, with differentiable surrogates available for non-differentiable simulators. Across benchmarks and high-dimensional tasks, Sourcerer achieves substantially higher entropy in the estimated sources without sacrificing data fidelity, and demonstrates practical utility in inferring Hodgkin-Huxley parameters from thousands of real neuron recordings. This approach provides a principled, scalable means to quantify uncertainty and priors in mechanistic scientific models, with broad applicability to likelihood-free simulation-based inference and high-dimensional observation spaces.
Abstract
Scientific modeling applications often require estimating a distribution of parameters consistent with a dataset of observations - an inference task also known as source distribution estimation. This problem can be ill-posed, however, since many different source distributions might produce the same distribution of data-consistent simulations. To make a principled choice among many equally valid sources, we propose an approach which targets the maximum entropy distribution, i.e., prioritizes retaining as much uncertainty as possible. Our method is purely sample-based - leveraging the Sliced-Wasserstein distance to measure the discrepancy between the dataset and simulations - and thus suitable for simulators with intractable likelihoods. We benchmark our method on several tasks, and show that it can recover source distributions with substantially higher entropy than recent source estimation methods, without sacrificing the fidelity of the simulations. Finally, to demonstrate the utility of our approach, we infer source distributions for parameters of the Hodgkin-Huxley model from experimental datasets with thousands of single-neuron measurements. In summary, we propose a principled method for inferring source distributions of scientific simulator parameters while retaining as much uncertainty as possible.
