Table of Contents
Fetching ...

Universally Converging Representations of Matter Across Scientific Foundation Models

Sathya Edamadaka, Soojung Yang, Ju Li, Rafael Gómez-Bombarelli

TL;DR

The paper investigates whether scientific foundation models across molecules, materials, and proteins learn a universal latent representation of matter. By analyzing ~60 diverse models using CKNNNA, distance correlation, intrinsic dimensionality, and information imbalance across multiple datasets, the authors show strong representational alignment that increases with model performance, suggesting convergence toward a common physical representation. They also identify two failure regimes—data-limited in-distribution and architecture-dominated out-of-distribution—highlighting current limits of generality and the need for more diverse training data. The work establishes representational alignment as a quantitative benchmark for foundation-level generality and provides practical guidance for model selection and architectural design to maximize transfer across modalities and domains.

Abstract

Machine learning models of vastly different modalities and architectures are being trained to predict the behavior of molecules, materials, and proteins. However, it remains unclear whether they learn similar internal representations of matter. Understanding their latent structure is essential for building scientific foundation models that generalize reliably beyond their training domains. Although representational convergence has been observed in language and vision, its counterpart in the sciences has not been systematically explored. Here, we show that representations learned by nearly sixty scientific models, spanning string-, graph-, 3D atomistic, and protein-based modalities, are highly aligned across a wide range of chemical systems. Models trained on different datasets have highly similar representations of small molecules, and machine learning interatomic potentials converge in representation space as they improve in performance, suggesting that foundation models learn a common underlying representation of physical reality. We then show two distinct regimes of scientific models: on inputs similar to those seen during training, high-performing models align closely and weak models diverge into local sub-optima in representation space; on vastly different structures from those seen during training, nearly all models collapse onto a low-information representation, indicating that today's models remain limited by training data and inductive bias and do not yet encode truly universal structure. Our findings establish representational alignment as a quantitative benchmark for foundation-level generality in scientific models. More broadly, our work can track the emergence of universal representations of matter as models scale, and for selecting and distilling models whose learned representations transfer best across modalities, domains of matter, and scientific tasks.

Universally Converging Representations of Matter Across Scientific Foundation Models

TL;DR

The paper investigates whether scientific foundation models across molecules, materials, and proteins learn a universal latent representation of matter. By analyzing ~60 diverse models using CKNNNA, distance correlation, intrinsic dimensionality, and information imbalance across multiple datasets, the authors show strong representational alignment that increases with model performance, suggesting convergence toward a common physical representation. They also identify two failure regimes—data-limited in-distribution and architecture-dominated out-of-distribution—highlighting current limits of generality and the need for more diverse training data. The work establishes representational alignment as a quantitative benchmark for foundation-level generality and provides practical guidance for model selection and architectural design to maximize transfer across modalities and domains.

Abstract

Machine learning models of vastly different modalities and architectures are being trained to predict the behavior of molecules, materials, and proteins. However, it remains unclear whether they learn similar internal representations of matter. Understanding their latent structure is essential for building scientific foundation models that generalize reliably beyond their training domains. Although representational convergence has been observed in language and vision, its counterpart in the sciences has not been systematically explored. Here, we show that representations learned by nearly sixty scientific models, spanning string-, graph-, 3D atomistic, and protein-based modalities, are highly aligned across a wide range of chemical systems. Models trained on different datasets have highly similar representations of small molecules, and machine learning interatomic potentials converge in representation space as they improve in performance, suggesting that foundation models learn a common underlying representation of physical reality. We then show two distinct regimes of scientific models: on inputs similar to those seen during training, high-performing models align closely and weak models diverge into local sub-optima in representation space; on vastly different structures from those seen during training, nearly all models collapse onto a low-information representation, indicating that today's models remain limited by training data and inductive bias and do not yet encode truly universal structure. Our findings establish representational alignment as a quantitative benchmark for foundation-level generality in scientific models. More broadly, our work can track the emergence of universal representations of matter as models scale, and for selecting and distilling models whose learned representations transfer best across modalities, domains of matter, and scientific tasks.

Paper Structure

This paper contains 38 sections, 19 equations, 12 figures, 1 table.

Figures (12)

  • Figure 1: A shows selected model representations for 15,000 structures from each sAlex, OMat24, QM9, and OMol25, visualized in two dimensions with UMAP (the same plot for all models is shown in Fig. \ref{['fig:umap_all']}). Materials (OMat24 and OMol25) embeddings are largely overlapping. Molecule (QM9 and OMol25) embeddings occupy a completely different part of embedding space, falling far outside the distribution of materials embeddings. B shows information imbalance (II) for model embeddings of OMat24 (top row) and OMol25 (bottom row). Embeddings of in-distribution, OMat24 structures are significantly more spread out towards the top right, indicating that models represent different information. Embeddings of out-of-distribution, OMol25 structures are significantly more clustered towards the bottom left, indicating that models all represent nearly identical information. Each column includes the same information imbalance plots but colored with a different representational similarity metric, showing how CKNNA (local) and DCor (global) align very well with information imbalance, while $I_d$Cor basile2025intrinsicdimensioncorrelationuncovering, which quantifies the correlation between different models' intrinsic dimensionality, agrees less. C shows the relationship between the local alignment between models for in-distribution structures (OMat24) versus out-of-distribution structures (OMol25). Although alignment in-distribution is higher locally for in-distribution structure embeddings, it is higher globally (as per information imbalance and the aligned DCors similarities (B bottom left) out-of-distribution structure embeddings. D shows that larger training task (energy prediction) error correlates with structures being out-of-distribution. Each point is an embedding of a structure from OMat24 or OMol25 by Orb V3 Conservative Inf OMat, and is colored by energy prediction error. The left, bluer (lower-error) cluster are OMat24 structure embeddings, and the right, whiter (higher-error) cluster that extends outside the cluster of in-distribution embeddings are OMol25 structure embeddings.
  • Figure 2: Scientific model representations of 1,000 structures from QM9 converge with increasing performance. Namely, as models decrease in energy regression MAE of small molecules, their representational similarity to the best performing model (Orb V3 Conservative Inf MP) increases. Each point represents a single model, and its size is proportional to the size of the model. Instead of a local metric like CKNNA to measure alignment, we use a global metric called dCor, defined in Section \ref{['si:dcor']}. This is because small molecules from QM9 are out-of-distribution, and local neighborhoods in representation space of out-of-distribution inputs are not guaranteed to be as meaningful as for in-distribution inputs like OMat24.
  • Figure 3: CKNNA's sensitivity to k evaluated with the sAlex dataset. A shows that as $k$ increases, CKNNA increases monotonically, preserving the ordering of model alignments almost exactly.
  • Figure 4: Model representations of 15,000 structures from each sAlex, OMat24, QM9, and OMol25, visualized in two dimensions with UMAP.
  • Figure 5: Full CKNNA correlation matrix ($k=25$) between each model's embeddings of the same 50,000 structures randomly sampled from QM9. A random baseline was also included.
  • ...and 7 more figures