Einstein from Noise: Statistical Analysis

Amnon Balanov; Wasim Huleihel; Tamir Bendory

Einstein from Noise: Statistical Analysis

Amnon Balanov, Wasim Huleihel, Tamir Bendory

Abstract

``Einstein from noise" (EfN) is a prominent example of the model bias phenomenon: systematic errors in the statistical model that lead to spurious but consistent estimates. In the EfN experiment, one falsely believes that a set of observations contains noisy, shifted copies of a template signal (e.g., an Einstein image), whereas in reality, it contains only pure noise observations. To estimate the signal, the observations are first aligned with the template using cross-correlation, and then averaged. Although the observations contain nothing but noise, it was recognized early on that this process produces a signal that resembles the template signal! This pitfall was at the heart of a central scientific controversy about validation techniques in structural biology. This paper provides a comprehensive statistical analysis of the EfN phenomenon above. We show that the Fourier phases of the EfN estimator (namely, the average of the aligned noise observations) converge to the Fourier phases of the template signal, explaining the observed structural similarity. Additionally, we prove that the convergence rate is inversely proportional to the number of noise observations and, in the high-dimensional regime, to the Fourier magnitudes of the template signal. Moreover, in the high-dimensional regime, the Fourier magnitudes converge to a scaled version of the template signal's Fourier magnitudes. This work not only deepens the theoretical understanding of the EfN phenomenon but also highlights potential pitfalls in template matching techniques and emphasizes the need for careful interpretation of noisy observations across disciplines in engineering, statistics, physics, and biology.

Einstein from Noise: Statistical Analysis

Abstract

Paper Structure (79 sections, 29 theorems, 360 equations, 6 figures)

This paper contains 79 sections, 29 theorems, 360 equations, 6 figures.

Introduction
Main results.
Organization.
Problem Formulation and Notation
Notations.
Problem formulation.
Fourier space notation.
Assumptions.
Cryo-EM and Empirical Demonstration
The EfN controversy.
Empirical demonstration.
More applications.
Previous work.
Main Results
Finite-dimensional signal
...and 64 more sections

Key Result

Theorem 4.1

Fix $d\geq 2$ and assume that $\mathsf{X}[k] \neq 0$, for all $0< k \leq d-1$.

Figures (6)

Figure 1: Einstein from Noise. The EfN estimator consists of three stages: (1) finding the index of the maximum of the cross-correlation ($\hat{\mathsf{R}}_i$) between the $i$-th noise signal ($n_i$) and the template signal (e.g., Einstein's image); (2) cyclically shifting the noise signal by $-\hat{\mathsf{R}}_i$; (3) averaging the shifted noise signals. In this paper, we characterize the relationship between the output of this process---the EfN estimator---and the template signal.
Figure 2: The impact of the number of noise observations on the EfN estimator. The EfN estimator is defined in real space by \ref{['eqn:efnEstimatorRealSpace']} and in Fourier space by \ref{['eqn:estimatorFourierRepresentation_pre']}. (a) The structural similarity between the EfN estimator and the template image increases as a function of the number of noise observations ($M$). (b) The mean-square-error (MSE) between the Fourier phases of the template image $\mathsf{X}[k_1, k_2]$ and the EfN estimator $\hat{\mathsf{X}}[k_1, k_2]$ for $-100 \leq k_1, k_2 \leq 100$, where $k_1, k_2$ are the indices of the 2D DFT. The colors in the left panel in (b) represent the power spectral density (PSD) of the Einstein image, while the colors in the four right panels represent the MSE between the Fourier phases of the Einstein image and the EfN estimator, for each spectral component, with a varying number of observations ($M = 200, 500, 1500, 5000$). An increase in the number of observations leads to a lower MSE of the Fourier phases between the EfN estimator and the template signal. A similar trend can be seen with respect to the strength of the spectral components, i.e., stronger spectral components lead to lower Fourier phases MSE. (c) The convergence rate of the MSE between the Fourier phases of the EfN estimator and the Fourier phases of the template signal as a function of the number of observations across different frequencies. The MSE decays as $1/M$. In addition, stronger spectral components lead to lower MSE. Figures (b) and (c) were generated through 200 Monte-Carlo trials of the EfN process defined in \ref{['eqn:efnEstimatorRealSpace']}.
Figure 3: The influence of the power-spectral-density (PSD) of the template signal on the correlation between the template and the EfN estimator.(a) Three images of the letter A are shown, with an increasing zero-padding ratio. As the zero-padding ratio increases, the PSD flattens, and the cross-correlation (CC) between the template and the EfN estimator increases. This higher cross-correlation is evident in both the image background and the colors of the letter A. (b) Flatter PSDs lead to EfN estimators whose Fourier magnitudes are closer to those of the template image. The EfN estimators in these experiments were generated using $M=10^5$ observations.
Figure 4: Comparison between analytic expression and Monte-Carlo simulations for high-dimensional signals, $d$, and for signals with varying power spectral densities. The analytic predictions for Fourier-phase convergence and Fourier-magnitude scaling are given by \ref{['eqn:phaseConvergeneRateForAsymptoticD']} and \ref{['eqn:magnitudeConvergenceAsymptoticMandD']}, respectively. (a) Template PSDs for three template families at a representative dimension $d=8192$. For each dimension $d$, the template $x^{(d)}\in\mathbb{R}^d$ is generated directly at length $d$ as an exponentially decaying signal, $x^{(d)}_\ell[m]\propto \exp(-m/\alpha_\ell)$, $m=0,1,\ldots,d-1$, with decay parameters $\alpha_\ell\in\{0.02,\,2,\,10\}$ (Signals 1-3, respectively), followed by mean removal and normalization. (b) Monte-Carlo estimates of the Pearson cross-correlation $\mathrm{PCC}(x^{(d)}_\ell,\hat{x}^{(d)}_\ell)$ between the template $x^{(d)}_\ell$ and the EfN estimate $\hat{x}^{(d)}_\ell$ as a function of the signal length $d$ (with fixed sample size $M=10^4$). As $d$ increases, the correlation increases, particularly for templates with slower-decaying PSDs. (c) Per-frequency phase mean-squared error $\mathbb{E}|\phi_{\hat{\mathsf{X}}^{(d)}_\ell}[k]-\phi_{\mathsf{X}^{(d)}_\ell}[k]|^2$ at $d=8192$: Monte-Carlo estimates (blue) are compared with the asymptotic expression (red), i.e., the large-$(M,d)$ closed-form approximation predicted by \ref{['eqn:phaseConvergeneRateForAsymptoticD']}. All Monte-Carlo curves are averaged over $2000$ independent trials.
Figure 5: The impact of noise statistics and signal dimension ($d$) on Fourier phase convergence. Each panel displays the mean squared error (MSE) between the Fourier phases of the true template and those estimated by EfN, shown for three representative Fourier components. The dashed line represents the theoretical $1/M$ convergence rate. Columns correspond to different noise distributions: white Gaussian noise, i.i.d. noise drawn from a uniform distribution over the interval $[0,1]$, and i.i.d. Poisson noise with parameter $\lambda = 10$. Rows correspond to increasing signal dimensions: $d = 8$, $32$, and $1024$. For white Gaussian noise, the Fourier phases converge at the expected $1/M$ rate across all signal dimensions, in agreement with Theorem \ref{['thm:1']}. In contrast, under uniform and Poisson noise, the MSE plateaus at low dimensions. However, increasing the signal dimension restores convergence, even under non-Gaussian noise, consistent with the high-dimensional regime described in Theorem \ref{['thm:highDimentionalNoiseExtention']}. Notably, for $d = 1024$, all three noise models produce similar MSE values across the selected Fourier components, suggesting that their phase noise statistics become nearly indistinguishable. Each data point represents an average of 300 Monte Carlo trials.
...and 1 more figures

Theorems & Definitions (56)

Theorem 4.1: Fourier phases convergence for finite-dimensional signal
Theorem 4.3: Fourier phases convergence for high-dimensional signal
Proposition 5.1: Positive correlation
Theorem 5.2: High-dimensional i.i.d. noise
Definition 5.3: Symmetric circulant matrix
Proposition 5.4: Fourier phase convergence under circulant Gaussian noise
Lemma A.1
Remark A.2
proof : Proof of Lemma \ref{['lemma:conditioning']}
Lemma A.3: Uniqueness of the maximizer
...and 46 more

Einstein from Noise: Statistical Analysis

Abstract

Einstein from Noise: Statistical Analysis

Authors

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (56)