Boltz is a Strong Baseline for Atom-level Representation Learning
Hyosoon Jang, Hyunjin Seo, Yunhui Jang, Seonghyun Park, Sungsoo Ahn
TL;DR
Boltz2 demonstrates that atom-level representations learned from protein–ligand co-folding transfer effectively to standalone small-molecule tasks, challenging the notion that such models rely solely on protein evolutionary signals. By evaluating Boltz2 on ADMET benchmarks, distilling its representations into GruM for diffusion-based generation, and applying it to online structure-guided ligand discovery, the work shows competitive property prediction, improved generation quality, and faster discovery with representation alignment. It also reveals that Boltz2 occupies a distinct representation space that can complement existing small-molecule foundation models, and that co-folding pretraining provides transferable chemical physics signals beyond protein contexts. Overall, Boltz emerges as a strong atom-level baseline for small-molecule representation learning and a promising bridge between protein-centric models and small-molecule discovery.
Abstract
Foundation models in molecular learning have advanced along two parallel tracks: protein models, which typically utilize evolutionary information to learn amino acid-level representations for folding, and small-molecule models, which focus on learning atom-level representations for property prediction tasks such as ADMET. Notably, cutting-edge protein-centric models such as Boltz now operate at atom-level granularity for protein-ligand co-folding, yet their atom-level expressiveness for small-molecule tasks remains unexplored. A key open question is whether these protein co-folding models capture transferable chemical physics or rely on protein evolutionary signals, which would limit their utility for small-molecule tasks. In this work, we investigate the quality of Boltz atom-level representations across diverse small-molecule benchmarks. Our results show that Boltz is competitive with specialized baselines on ADMET property prediction tasks and effective for molecular generation and optimization. These findings suggest that the representational capacity of cutting-edge protein-centric models has been underexplored and position Boltz as a strong baseline for atom-level representation learning for small molecules.
