Table of Contents
Fetching ...

Towards Foundation Models for Cryo-ET Subtomogram Analysis

Runmin Jiang, Wanyue Feng, Yuntian Yang, Shriya Pingulkar, Hong Wang, Xi Xiao, Xiaoyu Cao, Genpei Zhang, Xiao Wang, Xiaolong Wu, Tianyang Wang, Yang Liu, Xingjian Li, Min Xu

TL;DR

This work tackles the core barriers in cryo-ET subtomogram analysis—scarce annotations, severe noise, and domain shifts—by proposing the first foundation-model framework for subtomograms. It combines CryoEngine, a biophysically grounded synthetic data generator that yields $904{,}000$ subtomograms across $452$ classes, with Adaptive Phase Tokenization ViT (APT-ViT), an SE(3)-aware backbone that uses adaptive phase selection and spherical steerable convolutions to achieve robust translation and rotation equivariance. A Noise-Resilient Contrastive Learning (NRCL) strategy further stabilizes representation learning under extreme noise, employing clean-noisy pairings and MoCo-style training. Across 24 synthetic and real datasets, the integrated approach achieves state-of-the-art performance on classification, alignment, and averaging, demonstrating strong generalization and effective transfer to real data, thereby enabling scalable, robust, and multi-task subtomogram analysis in cryo-ET. This framework provides a blueprint for building scalable cryo-ET foundation models that can accelerate in situ structural discovery in biology.

Abstract

Cryo-electron tomography (cryo-ET) enables in situ visualization of macromolecular structures, where subtomogram analysis tasks such as classification, alignment, and averaging are critical for structural determination. However, effective analysis is hindered by scarce annotations, severe noise, and poor generalization. To address these challenges, we take the first step towards foundation models for cryo-ET subtomograms. First, we introduce CryoEngine, a large-scale synthetic data generator that produces over 904k subtomograms from 452 particle classes for pretraining. Second, we design an Adaptive Phase Tokenization-enhanced Vision Transformer (APT-ViT), which incorporates adaptive phase tokenization as an equivariance-enhancing module that improves robustness to both geometric and semantic variations. Third, we introduce a Noise-Resilient Contrastive Learning (NRCL) strategy to stabilize representation learning under severe noise conditions. Evaluations across 24 synthetic and real datasets demonstrate state-of-the-art (SOTA) performance on all three major subtomogram tasks and strong generalization to unseen datasets, advancing scalable and robust subtomogram analysis in cryo-ET.

Towards Foundation Models for Cryo-ET Subtomogram Analysis

TL;DR

This work tackles the core barriers in cryo-ET subtomogram analysis—scarce annotations, severe noise, and domain shifts—by proposing the first foundation-model framework for subtomograms. It combines CryoEngine, a biophysically grounded synthetic data generator that yields subtomograms across classes, with Adaptive Phase Tokenization ViT (APT-ViT), an SE(3)-aware backbone that uses adaptive phase selection and spherical steerable convolutions to achieve robust translation and rotation equivariance. A Noise-Resilient Contrastive Learning (NRCL) strategy further stabilizes representation learning under extreme noise, employing clean-noisy pairings and MoCo-style training. Across 24 synthetic and real datasets, the integrated approach achieves state-of-the-art performance on classification, alignment, and averaging, demonstrating strong generalization and effective transfer to real data, thereby enabling scalable, robust, and multi-task subtomogram analysis in cryo-ET. This framework provides a blueprint for building scalable cryo-ET foundation models that can accelerate in situ structural discovery in biology.

Abstract

Cryo-electron tomography (cryo-ET) enables in situ visualization of macromolecular structures, where subtomogram analysis tasks such as classification, alignment, and averaging are critical for structural determination. However, effective analysis is hindered by scarce annotations, severe noise, and poor generalization. To address these challenges, we take the first step towards foundation models for cryo-ET subtomograms. First, we introduce CryoEngine, a large-scale synthetic data generator that produces over 904k subtomograms from 452 particle classes for pretraining. Second, we design an Adaptive Phase Tokenization-enhanced Vision Transformer (APT-ViT), which incorporates adaptive phase tokenization as an equivariance-enhancing module that improves robustness to both geometric and semantic variations. Third, we introduce a Noise-Resilient Contrastive Learning (NRCL) strategy to stabilize representation learning under severe noise conditions. Evaluations across 24 synthetic and real datasets demonstrate state-of-the-art (SOTA) performance on all three major subtomogram tasks and strong generalization to unseen datasets, advancing scalable and robust subtomogram analysis in cryo-ET.

Paper Structure

This paper contains 58 sections, 5 theorems, 40 equations, 15 figures, 22 tables.

Key Result

Lemma 1

Let $\mathcal{P}$ be the polyphase anchoring operator and $T_{\mathbf{g}}$ denote a translation operator that shifts $\mathbf{X}$ spatially by $\mathbf{g} = (g_D, g_H, g_W)$. Then, there exists a translation $\mathbf{g}' = (g'_D, g'_H, g'_W)$ such that where $T_{\mathbf{g}}$ is translation.

Figures (15)

  • Figure 1: Foundation model overview. (a) CryoEngine. Atomic structures from the RCSB Protein Data Bank are retrieved and converted to density volumes, processed via CryoEngine, and synthesized subtomograms with ground truth and multi-SNR versions. The composition and visualization of the structural data are shown in the diagram. (b) Model training. The input is a subtomogram pair, one derived from the other via a rigid SE(3) transform $T$. Following a MoCo-style paradigm MocoV3, the APT-ViT backbone (Sec. \ref{['sec:equi']}) encodes them into a latent embedding $z$, which is passed into two branches: one supervised by $T$, the other optimized with NRCL (Sec. \ref{['sec:noise']}). The total loss combines the two branches. (c) Usage and evaluation. The trained encoder serves as a frozen foundation model for downstream tasks, evaluated on both OOD and real data in Sec. \ref{['sec:exp']}.
  • Figure 2: CryoEngine architecture. Diverse atomic structures are converted into density maps, embedded into a virtual sample, projected into a tilt series, and reconstructed into a tomogram. $32^3$ subtomograms are then extracted, augmented with calibrated noise, and paired with ground-truth.
  • Figure 3: Overview of APT-ViT architecture. Input subtomograms undergo polyphase decomposition to generate multiple phase components with different spatial offsets. A learnable selection network based on 3D spherical steerable convolutions evaluates all phase components and produces selection probabilities to select the optimal phase-aligned tokens. The selected tokens are then processed by transformer blocks to produce refined embeddings for downstream subtomogram analysis tasks.
  • Figure 4: Visualization of alignment results, showing improved structural consistency in both 3D particles (a) and 2D subtomogram slices (b).
  • Figure 5: Diverse 30S ribosomes and corresponding noise-free subtomograms.
  • ...and 10 more figures

Theorems & Definitions (14)

  • Definition 1: Group Actions and $G$-sets
  • Definition 2: $G$-Equivariance
  • Definition 3: $G$-Invariance
  • Definition 4: Polyphase Decomposition
  • Lemma 1: Polyphase Decomposition Equivariance
  • proof
  • Lemma 2: Equivariance of Adaptive Phase Selection (APS)
  • proof
  • Theorem 3: APT Translation Equivariance
  • proof
  • ...and 4 more