Table of Contents
Fetching ...

Sourcerer: Sample-based Maximum Entropy Source Distribution Estimation

Julius Vetter, Guy Moss, Cornelius Schröder, Richard Gao, Jakob H. Macke

TL;DR

Sourcerer tackles the ill-posed problem of source distribution estimation by adopting a maximum-entropy principle to select a unique source distribution $q(\theta)$ that reproduces the observed data distribution $p_o(x)$ when pushed through a simulator $p(x|\theta)$. It introduces a purely sample-based, differentiable framework that maximizes $H(q)$ while enforcing data-consistency via a distance $D(q^{\#}, p_o)$, implemented with a regularized objective $\max_\phi \lambda H(q_\phi)-(1-\lambda)\log(D(q^{\#}_\phi, p_o))$ and a dynamic schedule for $\lambda$. The method relies on neural samplers for $q_\phi$, and uses the Kozachenko-Leonenko entropy estimator to remain fully sample-based; the distance metric is the Sliced-Wasserstein distance, with differentiable surrogates available for non-differentiable simulators. Across benchmarks and high-dimensional tasks, Sourcerer achieves substantially higher entropy in the estimated sources without sacrificing data fidelity, and demonstrates practical utility in inferring Hodgkin-Huxley parameters from thousands of real neuron recordings. This approach provides a principled, scalable means to quantify uncertainty and priors in mechanistic scientific models, with broad applicability to likelihood-free simulation-based inference and high-dimensional observation spaces.

Abstract

Scientific modeling applications often require estimating a distribution of parameters consistent with a dataset of observations - an inference task also known as source distribution estimation. This problem can be ill-posed, however, since many different source distributions might produce the same distribution of data-consistent simulations. To make a principled choice among many equally valid sources, we propose an approach which targets the maximum entropy distribution, i.e., prioritizes retaining as much uncertainty as possible. Our method is purely sample-based - leveraging the Sliced-Wasserstein distance to measure the discrepancy between the dataset and simulations - and thus suitable for simulators with intractable likelihoods. We benchmark our method on several tasks, and show that it can recover source distributions with substantially higher entropy than recent source estimation methods, without sacrificing the fidelity of the simulations. Finally, to demonstrate the utility of our approach, we infer source distributions for parameters of the Hodgkin-Huxley model from experimental datasets with thousands of single-neuron measurements. In summary, we propose a principled method for inferring source distributions of scientific simulator parameters while retaining as much uncertainty as possible.

Sourcerer: Sample-based Maximum Entropy Source Distribution Estimation

TL;DR

Sourcerer tackles the ill-posed problem of source distribution estimation by adopting a maximum-entropy principle to select a unique source distribution that reproduces the observed data distribution when pushed through a simulator . It introduces a purely sample-based, differentiable framework that maximizes while enforcing data-consistency via a distance , implemented with a regularized objective and a dynamic schedule for . The method relies on neural samplers for , and uses the Kozachenko-Leonenko entropy estimator to remain fully sample-based; the distance metric is the Sliced-Wasserstein distance, with differentiable surrogates available for non-differentiable simulators. Across benchmarks and high-dimensional tasks, Sourcerer achieves substantially higher entropy in the estimated sources without sacrificing data fidelity, and demonstrates practical utility in inferring Hodgkin-Huxley parameters from thousands of real neuron recordings. This approach provides a principled, scalable means to quantify uncertainty and priors in mechanistic scientific models, with broad applicability to likelihood-free simulation-based inference and high-dimensional observation spaces.

Abstract

Scientific modeling applications often require estimating a distribution of parameters consistent with a dataset of observations - an inference task also known as source distribution estimation. This problem can be ill-posed, however, since many different source distributions might produce the same distribution of data-consistent simulations. To make a principled choice among many equally valid sources, we propose an approach which targets the maximum entropy distribution, i.e., prioritizes retaining as much uncertainty as possible. Our method is purely sample-based - leveraging the Sliced-Wasserstein distance to measure the discrepancy between the dataset and simulations - and thus suitable for simulators with intractable likelihoods. We benchmark our method on several tasks, and show that it can recover source distributions with substantially higher entropy than recent source estimation methods, without sacrificing the fidelity of the simulations. Finally, to demonstrate the utility of our approach, we infer source distributions for parameters of the Hodgkin-Huxley model from experimental datasets with thousands of single-neuron measurements. In summary, we propose a principled method for inferring source distributions of scientific simulator parameters while retaining as much uncertainty as possible.
Paper Structure (47 sections, 2 theorems, 15 equations, 16 figures, 3 tables, 1 algorithm)

This paper contains 47 sections, 2 theorems, 15 equations, 16 figures, 3 tables, 1 algorithm.

Key Result

Proposition 2.1

Let $Q = \{q|q^\# = p_o\}$ be the set of source distributions for a given likelihood $p(x|\theta)$ and data distribution $p_o$. Suppose that $Q$ is non-empty and compact. Then $q^* = \mathop{\mathrm{arg\,max}}\limits_{q \in Q}H(q)$ exists and is unique.

Figures (16)

  • Figure 1: Maximum entropy source distribution estimation. Given an observed dataset $\mathcal{D} = \{x_{1},\ldots,x_{n}\}$ from some data distribution $p_{o}(x)$, the source distribution estimation problem is to find the parameter distribution $q(\theta)$ that reproduces $p_{o}(x)$ when passed through the simulator $p(x|\theta)$, i.e. $q^{\#}(x) = \int p(x|\theta)q(\theta)d\theta = p_{o}(x)$ for all $x$. This problem can be ill-posed, as there might be more than one distinct source distribution. We resolve this by targeting the maximum entropy distribution, which is unique.
  • Figure 2: Overview of Sourcerer. Given a source distribution $q(\theta)$, we sample $\theta\sim q$ and simulate using $p(x|\theta)$ to obtain samples from the pushforward distribution $q^{\#}(x) = \int p(x|\theta)q(\theta)d\theta$. We maximize the entropy of the source distribution $q(\theta)$ while regularizing with a Sliced-Wasserstein distance (SWD) term between the pushforward of $q^{\#}$ and the data distribution $p_{o}(x)$ (Eq. \ref{['eq: unconstrained loss']}). $\Theta$ and $\mathcal{X}$ in top right corner of boxes denote parameter space and data/observation space, respectively.
  • Figure 3: Results for the source estimation benchmark. (a) Original and estimated source and corresponding pushforward for the differentiable IK simulator ($\lambda=0.35$). The estimated source has higher entropy than the original source that was used to generate the data. The observations (simulated with parameters from the original source) and simulations (simulated with parameters from the estimated source) match. (b) Performance of our approach for all four benchmark tasks (TM, IK, SLCP, GM) using both the original (differentiable) simulators, and learned surrogates. Source estimation is performed without (NA) and with entropy regularization for different choices of $\lambda$. For all cases, mean C2ST accuracy between observations and simulations (lower is better) as well as the mean entropy of estimated sources (higher is better) over five runs are shown together with the standard deviation. The gray line at $\lambda=0.35$ ($\lambda=0.062$ for GM) indicates our choice of final $\lambda$ for the numerical benchmark results (Table \ref{['tab:benchmark_numbers']}).
  • Figure 4: Source estimation on differentiable simulators. For both the deterministic SIR model (a) and probabilistic Lotka-Volterra model (b), the Sliced-Wasserstein distance (lower is better) between observations and simulations as well as entropy of estimated sources (higher is better) for different choices of $\lambda$ and without the entropy regularization (NA) are shown. Mean and standard deviation are computed over five runs.
  • Figure 5: Source estimation for the single-compartment Hodgkin-Huxley model. (a) Example voltage traces of the real observations of the motor cortex dataset, simulations from the estimated source ($\lambda=0.25$), and samples from the uniform distribution used to train the surrogate. (b) 1D and 2D marginals for three of the five summary statistics used to perform source estimation. (c) 1D and 2D marginal distributions of the estimated source for three of the 13 simulator parameters. (d) and (e) C2ST accuracy and Sliced-Wasserstein distance (lower is better) as well as entropy of estimated sources (higher is better) for different choices of $\lambda$ including $\lambda=0.25$ (gray line) and without entropy regularization (NA). Mean and standard deviation over five runs are shown.
  • ...and 11 more figures

Theorems & Definitions (2)

  • Proposition 2.1
  • Proposition A.1