Table of Contents
Fetching ...

Boltz is a Strong Baseline for Atom-level Representation Learning

Hyosoon Jang, Hyunjin Seo, Yunhui Jang, Seonghyun Park, Sungsoo Ahn

TL;DR

Boltz2 demonstrates that atom-level representations learned from protein–ligand co-folding transfer effectively to standalone small-molecule tasks, challenging the notion that such models rely solely on protein evolutionary signals. By evaluating Boltz2 on ADMET benchmarks, distilling its representations into GruM for diffusion-based generation, and applying it to online structure-guided ligand discovery, the work shows competitive property prediction, improved generation quality, and faster discovery with representation alignment. It also reveals that Boltz2 occupies a distinct representation space that can complement existing small-molecule foundation models, and that co-folding pretraining provides transferable chemical physics signals beyond protein contexts. Overall, Boltz emerges as a strong atom-level baseline for small-molecule representation learning and a promising bridge between protein-centric models and small-molecule discovery.

Abstract

Foundation models in molecular learning have advanced along two parallel tracks: protein models, which typically utilize evolutionary information to learn amino acid-level representations for folding, and small-molecule models, which focus on learning atom-level representations for property prediction tasks such as ADMET. Notably, cutting-edge protein-centric models such as Boltz now operate at atom-level granularity for protein-ligand co-folding, yet their atom-level expressiveness for small-molecule tasks remains unexplored. A key open question is whether these protein co-folding models capture transferable chemical physics or rely on protein evolutionary signals, which would limit their utility for small-molecule tasks. In this work, we investigate the quality of Boltz atom-level representations across diverse small-molecule benchmarks. Our results show that Boltz is competitive with specialized baselines on ADMET property prediction tasks and effective for molecular generation and optimization. These findings suggest that the representational capacity of cutting-edge protein-centric models has been underexplored and position Boltz as a strong baseline for atom-level representation learning for small molecules.

Boltz is a Strong Baseline for Atom-level Representation Learning

TL;DR

Boltz2 demonstrates that atom-level representations learned from protein–ligand co-folding transfer effectively to standalone small-molecule tasks, challenging the notion that such models rely solely on protein evolutionary signals. By evaluating Boltz2 on ADMET benchmarks, distilling its representations into GruM for diffusion-based generation, and applying it to online structure-guided ligand discovery, the work shows competitive property prediction, improved generation quality, and faster discovery with representation alignment. It also reveals that Boltz2 occupies a distinct representation space that can complement existing small-molecule foundation models, and that co-folding pretraining provides transferable chemical physics signals beyond protein contexts. Overall, Boltz emerges as a strong atom-level baseline for small-molecule representation learning and a promising bridge between protein-centric models and small-molecule discovery.

Abstract

Foundation models in molecular learning have advanced along two parallel tracks: protein models, which typically utilize evolutionary information to learn amino acid-level representations for folding, and small-molecule models, which focus on learning atom-level representations for property prediction tasks such as ADMET. Notably, cutting-edge protein-centric models such as Boltz now operate at atom-level granularity for protein-ligand co-folding, yet their atom-level expressiveness for small-molecule tasks remains unexplored. A key open question is whether these protein co-folding models capture transferable chemical physics or rely on protein evolutionary signals, which would limit their utility for small-molecule tasks. In this work, we investigate the quality of Boltz atom-level representations across diverse small-molecule benchmarks. Our results show that Boltz is competitive with specialized baselines on ADMET property prediction tasks and effective for molecular generation and optimization. These findings suggest that the representational capacity of cutting-edge protein-centric models has been underexplored and position Boltz as a strong baseline for atom-level representation learning for small molecules.
Paper Structure (27 sections, 7 equations, 8 figures, 8 tables, 2 algorithms)

This paper contains 27 sections, 7 equations, 8 figures, 8 tables, 2 algorithms.

Figures (8)

  • Figure 1: Boltz as atom-level small molecular foundation models. We repurpose Boltz, originally trained for protein-ligand co-folding, as a small-molecule representation model by leveraging atom-level ligand representations.
  • Figure 2: Boltz2 vs. existing foundation models on ADMET benchmarks. As illustrated, Boltz2 shows competitive performance compared to existing foundation models specialized for small molecules on four out of five domains.
  • Figure 3: Representation alignment with foundation models vs. generation quality. Stronger alignment with Boltz2 representations correlates with higher molecular generation quality.
  • Figure 4: Training acceleration using Boltz2. Representation alignment with Boltz2 accelerates training of generative models.
  • Figure 5: Results on structure-guided ligand discovery. The results are averaged over three random seeds. Representation alignment with Boltz2 improves the sample efficiency for discovering high-score molecules that bind to target structures.
  • ...and 3 more figures