CryoBench: Diverse and challenging datasets for the heterogeneity problem in cryo-EM

Minkyu Jeon; Rishwanth Raghu; Miro Astore; Geoffrey Woollard; Ryan Feathers; Alkin Kaz; Sonya M. Hanson; Pilar Cossio; Ellen D. Zhong

CryoBench: Diverse and challenging datasets for the heterogeneity problem in cryo-EM

Minkyu Jeon, Rishwanth Raghu, Miro Astore, Geoffrey Woollard, Ryan Feathers, Alkin Kaz, Sonya M. Hanson, Pilar Cossio, Ellen D. Zhong

TL;DR

CryoBench addresses the lack of standardized benchmarks for heterogeneous cryo-EM reconstruction by introducing five synthetic datasets that span conformational and compositional heterogeneity, along with a forward imaging model and ground-truth coordinates. It provides a comprehensive evaluation framework with embedding-based metrics (Neighborhood Similarity and Information Imbalance) and FSC-based volume metrics (FSC_AUC and Per-Image FSC), and benchmarks a suite of ten methods across these datasets. The findings reveal strengths and gaps among current methods, showing that some fixed-pose approaches perform well on simpler continua while complex, MD-derived or large-state mixtures remain challenging for ab initio reconstruction. By releasing datasets, metrics, and tooling, CryoBench aims to accelerate method development, enable rigorous comparisons, and stimulate biophysically informed improvements in cryo-EM heterogeneity analysis.

Abstract

Cryo-electron microscopy (cryo-EM) is a powerful technique for determining high-resolution 3D biomolecular structures from imaging data. Its unique ability to capture structural variability has spurred the development of heterogeneous reconstruction algorithms that can infer distributions of 3D structures from noisy, unlabeled imaging data. Despite the growing number of advanced methods, progress in the field is hindered by the lack of standardized benchmarks with ground truth information and reliable validation metrics. Here, we introduce CryoBench, a suite of datasets, metrics, and benchmarks for heterogeneous reconstruction in cryo-EM. CryoBench includes five datasets representing different sources of heterogeneity and degrees of difficulty. These include conformational heterogeneity generated from designed motions of antibody complexes or sampled from a molecular dynamics simulation, as well as compositional heterogeneity from mixtures of ribosome assembly states or 100 common complexes present in cells. We then analyze state-of-the-art heterogeneous reconstruction tools, including neural and non-neural methods, assess their sensitivity to noise, and propose new metrics for quantitative evaluation. We hope that CryoBench will be a foundational resource for accelerating algorithmic development and evaluation in the cryo-EM and machine learning communities. Project page: https://cryobench.cs.princeton.edu.

CryoBench: Diverse and challenging datasets for the heterogeneity problem in cryo-EM

TL;DR

Abstract

Paper Structure (51 sections, 4 equations, 38 figures, 9 tables)

This paper contains 51 sections, 4 equations, 38 figures, 9 tables.

Introduction
Background and Related Work
CryoBench Design
Image Formation Model
Conformational Heterogeneity
Compositional Heterogeneity
Evaluation Framework
Analysis and Metrics
Embedding Comparisons
Volume Metrics
Results
Conclusion and Future Directions
Data and Software Availability
Data Availability
Software Availability
...and 36 more sections

Figures (38)

Figure 1: Overview of CryoBench. a) Image formation model. In cryo-EM, each image $X_i$ captures a molecule $V_i$ projected at an unknown pose $\phi_i$. A latent variable $z_i$ models the conformational space $\mathcal{V}$ that describes the heterogeneity among the set of molecules $\{V_i\}$. b) Datasets. CryoBench includes 5 synthetic datasets of varying difficulty, characterized by heterogeneity arising from conformational (i.e. shape) or compositional (i.e. identity) changes. c) Methods. Methods can be grouped into using either a continuous latent variable $z$ or discrete latent variable $\pi$ for modeling heterogeneity. Hidden variables assumed to be known are shown in gray. Volumes are represented as a neural field (NF), voxel array (VA), neural volume (NV), or tetrahedral mesh (TM). Generative models are colored blue for nonlinear neural methods; orange for linear generative models, pink for mixture models; and green for density-preserving motion models. d) Metrics. Summary of metrics used to assess both latent inference and volume reconstruction quality.
Figure 2: IgG-1D results.a) Dataset design. Conformational heterogeneity of an IgG antibody complex produced from a simple, 1D continuous circular motion. b) Representative reconstructed and ground truth (G.T.) volumes. c) Latent embeddings visualized by UMAP and colored by the G.T. dihedral angle parameterizing the circular motion. Discrete class assignments are plotted by G.T. dihedral angles. d) Latent embedding analysis by neighborhood similarity and information imbalance. e) Per-Image FSC curves. Each curve shows the average FSC curve across all conformations with error bars indicating the standard deviation. Colors in b), d), and e) correspond to methods shown in the legend. Additional results shown in Figure \ref{['fig:si_kmeans_igg1d']}.
Figure 3: IgG-1D with noise.a) Per-Image FSC for each method at different noise levels. Markers correspond to the legend in Figure \ref{['fig:fig2_conf1']}. b) Example cryo-EM images for different noise levels and latent embeddings visualized by UMAP for CryoDRGN-AI. Additional results shown in Figure \ref{['fig:si_noise']}, \ref{['fig:si_kmeans_igg1d_snr0005']}, and \ref{['fig:si_kmeans_igg1d_snr0001']}.
Figure 4: IgG-RL results.a) Dataset design. Conformational heterogeneity is produced by sampling 100 configurations of a peptide linker, randomly orienting the FAb domain in the IgG antibody complex. b) Representative reconstructed and ground truth (G.T.) volumes. c) The UMAP plots of RECOVAR and OPUS-DSD latent spaces colored by the distance between the FAb and the Fc domain in the G.T. volumes. d) Latent embedding analysis by neighborhood similarity and information imbalance. e) Per-Image FSC curves. Each curve shows the average FSC curve across all conformations with error bars indicating the standard deviation. Colors in (d), (e), and (f) correspond to methods shown in the legend. Additional results shown in Figure \ref{['fig:si_igg_rl_latents_cv']} and \ref{['fig:si_kmeans_iggrl']}.
Figure 5: Spike-MD results.a) Dataset design. 46,789 structures were sampled from a MD simulation of the SARS-COV-2 spike protein, including opening of the receptor binding domain (RBD, shown in red). The motion in the MD simulation can be described with two collective variables (CV), corresponding to opening and twisting of the RBD. b) The population density of molecular states projected onto these CVs. c) Representative reconstructed and ground truth (G.T.) volumes. d) Latent embeddings visualized by UMAP and colored by the first and second CV. Additional results shown in Figure \ref{['fig:si_per_conf_md']}, \ref{['fig:si_md_vols']}, and \ref{['fig:si_MD_embd_metrics']}.
...and 33 more figures

CryoBench: Diverse and challenging datasets for the heterogeneity problem in cryo-EM

TL;DR

Abstract

CryoBench: Diverse and challenging datasets for the heterogeneity problem in cryo-EM

Authors

TL;DR

Abstract

Table of Contents

Figures (38)