Table of Contents
Fetching ...

CryoFastAR: Fast Cryo-EM Ab Initio Reconstruction Made Easy

Jiakai Zhang, Shouchen Zhou, Haizhao Dai, Xinhang Liu, Peihao Wang, Zhiwen Fan, Yuan Pei, Jingyi Yu

TL;DR

CryoFastAR introduces a feed-forward geometric foundation model for fast ab initio cryo-EM reconstruction by directly predicting relative poses from unordered, noisy particle images. It employs a Vision Transformer–based encoder to extract multi-view features and a cross-attentive decoder to produce Fourier planar maps that encode poses relative to a reference view, enabling efficient Fourier-domain back-projection for reconstruction. The model is trained with a progressive curriculum on a large-scale simulated dataset and fine-tuned on real cryo-EM data, achieving competitive reconstruction quality while delivering substantial speedups over traditional iterative pipelines. This work demonstrates the viability of end-to-end pose prediction in cryo-EM and highlights the potential of geometric foundation models to accelerate and stabilize high-resolution structure determination under challenging imaging conditions.

Abstract

Pose estimation from unordered images is fundamental for 3D reconstruction, robotics, and scientific imaging. Recent geometric foundation models, such as DUSt3R, enable end-to-end dense 3D reconstruction but remain underexplored in scientific imaging fields like cryo-electron microscopy (cryo-EM) for near-atomic protein reconstruction. In cryo-EM, pose estimation and 3D reconstruction from unordered particle images still depend on time-consuming iterative optimization, primarily due to challenges such as low signal-to-noise ratios (SNR) and distortions from the contrast transfer function (CTF). We introduce CryoFastAR, the first geometric foundation model that can directly predict poses from Cryo-EM noisy images for Fast ab initio Reconstruction. By integrating multi-view features and training on large-scale simulated cryo-EM data with realistic noise and CTF modulations, CryoFastAR enhances pose estimation accuracy and generalization. To enhance training stability, we propose a progressive training strategy that first allows the model to extract essential features under simpler conditions before gradually increasing difficulty to improve robustness. Experiments show that CryoFastAR achieves comparable quality while significantly accelerating inference over traditional iterative approaches on both synthetic and real datasets.

CryoFastAR: Fast Cryo-EM Ab Initio Reconstruction Made Easy

TL;DR

CryoFastAR introduces a feed-forward geometric foundation model for fast ab initio cryo-EM reconstruction by directly predicting relative poses from unordered, noisy particle images. It employs a Vision Transformer–based encoder to extract multi-view features and a cross-attentive decoder to produce Fourier planar maps that encode poses relative to a reference view, enabling efficient Fourier-domain back-projection for reconstruction. The model is trained with a progressive curriculum on a large-scale simulated dataset and fine-tuned on real cryo-EM data, achieving competitive reconstruction quality while delivering substantial speedups over traditional iterative pipelines. This work demonstrates the viability of end-to-end pose prediction in cryo-EM and highlights the potential of geometric foundation models to accelerate and stabilize high-resolution structure determination under challenging imaging conditions.

Abstract

Pose estimation from unordered images is fundamental for 3D reconstruction, robotics, and scientific imaging. Recent geometric foundation models, such as DUSt3R, enable end-to-end dense 3D reconstruction but remain underexplored in scientific imaging fields like cryo-electron microscopy (cryo-EM) for near-atomic protein reconstruction. In cryo-EM, pose estimation and 3D reconstruction from unordered particle images still depend on time-consuming iterative optimization, primarily due to challenges such as low signal-to-noise ratios (SNR) and distortions from the contrast transfer function (CTF). We introduce CryoFastAR, the first geometric foundation model that can directly predict poses from Cryo-EM noisy images for Fast ab initio Reconstruction. By integrating multi-view features and training on large-scale simulated cryo-EM data with realistic noise and CTF modulations, CryoFastAR enhances pose estimation accuracy and generalization. To enhance training stability, we propose a progressive training strategy that first allows the model to extract essential features under simpler conditions before gradually increasing difficulty to improve robustness. Experiments show that CryoFastAR achieves comparable quality while significantly accelerating inference over traditional iterative approaches on both synthetic and real datasets.

Paper Structure

This paper contains 54 sections, 18 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: CryoFastAR enables fast feed-forward ab initio reconstruction from hundreds of thousands of unordered, unposed, and highly noisy cryo-EM particle images. Compared to existing baselines, it achieves significantly higher reconstruction speed. We define reconstruction FPS as the average number of particle images processed per second during ab initio reconstruction.
  • Figure 2: Pipeline of CryoFastAR. Our method takes multiple noisy cryo-EM particle images as input and extracts patch-level features using a shared Vision Transformer (ViT) encoder, which incorporates 2D Rotary Position Embeddings (RoPE) and view embeddings. These extracted features are subsequently integrated through stacked View Integration and Refinement blocks. The model outputs Fourier planar maps via two prediction heads, encoding the relative poses of each view with respect to a reference view. Finally, these planar maps are converted to explicit pose parameters, enabling efficient 3D reconstruction via a direct back projection in Fourier space.
  • Figure 3: Qualitative Results. We compare our visual quality with all other baselines before and after the refinement for FA and Spike. The results show that our method is comparable to them before refinement and achieves the best performance after the refinement.
  • Figure 4: Qualitative comparison results on experimental Spliceosome dataset. Our method achieves the best visual quality and reconstruction resolution compared to other baselines, while CryoSPARC fails to converge to the correct structure due to the heterogeneity of the spliceosome.
  • Figure 5: Evaluation on view numbers and SNR. Our model shows robust performance across different SNRs and achieves better results when the input view number increases.
  • ...and 1 more figures