Beyond Learning on Molecules by Weakly Supervising on Molecules
Gordan Prastalo, Kevin Maik Jablonka
TL;DR
This work tackles the challenge that molecular representations are often not task-aligned, proposing ACE-Mol, a 150M-parameter transformer pretrained with weak supervision on hundreds of chemically meaningful motifs described in natural language. By pairing motif-derived properties with task prompts and employing two alternating objectives, $L_{SMILES}$ and $L_{value}$, ACE-Mol learns task-conditioned embeddings that rapidly align with specific chemistry tasks, enabling fast global adaptation and stable local refinement. Across MoleculeNet benchmarks and a synthetic Toxicity Benchmark, ACE-Mol achieves state-of-the-art performance and demonstrates embeddings that cluster by functional groups and attend to chemically relevant atoms, yielding interpretable representations. The approach highlights soft inductive biases as a scalable alternative to labeled-data–heavy methods, with broad implications for adapting foundation models to data-scarce scientific domains via text-conditioned task descriptions and motif-based supervision.
Abstract
Molecular representations are inherently task-dependent, yet most pre-trained molecular encoders are not. Task conditioning promises representations that reorganize based on task descriptions, but existing approaches rely on expensive labeled data. We show that weak supervision on programmatically derived molecular motifs is sufficient. Our Adaptive Chemical Embedding Model (ACE-Mol) learns from hundreds of motifs paired with natural language descriptors that are cheap to compute, trivial to scale. Conventional encoders slowly search the embedding space for task-relevant structure, whereas ACE-Mol immediately aligns its representations with the task. ACE-Mol achieves state-of-the-art performance across molecular property prediction benchmarks with interpretable, chemically meaningful representations.
