Table of Contents
Fetching ...

Beyond Learning on Molecules by Weakly Supervising on Molecules

Gordan Prastalo, Kevin Maik Jablonka

TL;DR

This work tackles the challenge that molecular representations are often not task-aligned, proposing ACE-Mol, a 150M-parameter transformer pretrained with weak supervision on hundreds of chemically meaningful motifs described in natural language. By pairing motif-derived properties with task prompts and employing two alternating objectives, $L_{SMILES}$ and $L_{value}$, ACE-Mol learns task-conditioned embeddings that rapidly align with specific chemistry tasks, enabling fast global adaptation and stable local refinement. Across MoleculeNet benchmarks and a synthetic Toxicity Benchmark, ACE-Mol achieves state-of-the-art performance and demonstrates embeddings that cluster by functional groups and attend to chemically relevant atoms, yielding interpretable representations. The approach highlights soft inductive biases as a scalable alternative to labeled-data–heavy methods, with broad implications for adapting foundation models to data-scarce scientific domains via text-conditioned task descriptions and motif-based supervision.

Abstract

Molecular representations are inherently task-dependent, yet most pre-trained molecular encoders are not. Task conditioning promises representations that reorganize based on task descriptions, but existing approaches rely on expensive labeled data. We show that weak supervision on programmatically derived molecular motifs is sufficient. Our Adaptive Chemical Embedding Model (ACE-Mol) learns from hundreds of motifs paired with natural language descriptors that are cheap to compute, trivial to scale. Conventional encoders slowly search the embedding space for task-relevant structure, whereas ACE-Mol immediately aligns its representations with the task. ACE-Mol achieves state-of-the-art performance across molecular property prediction benchmarks with interpretable, chemically meaningful representations.

Beyond Learning on Molecules by Weakly Supervising on Molecules

TL;DR

This work tackles the challenge that molecular representations are often not task-aligned, proposing ACE-Mol, a 150M-parameter transformer pretrained with weak supervision on hundreds of chemically meaningful motifs described in natural language. By pairing motif-derived properties with task prompts and employing two alternating objectives, and , ACE-Mol learns task-conditioned embeddings that rapidly align with specific chemistry tasks, enabling fast global adaptation and stable local refinement. Across MoleculeNet benchmarks and a synthetic Toxicity Benchmark, ACE-Mol achieves state-of-the-art performance and demonstrates embeddings that cluster by functional groups and attend to chemically relevant atoms, yielding interpretable representations. The approach highlights soft inductive biases as a scalable alternative to labeled-data–heavy methods, with broad implications for adapting foundation models to data-scarce scientific domains via text-conditioned task descriptions and motif-based supervision.

Abstract

Molecular representations are inherently task-dependent, yet most pre-trained molecular encoders are not. Task conditioning promises representations that reorganize based on task descriptions, but existing approaches rely on expensive labeled data. We show that weak supervision on programmatically derived molecular motifs is sufficient. Our Adaptive Chemical Embedding Model (ACE-Mol) learns from hundreds of motifs paired with natural language descriptors that are cheap to compute, trivial to scale. Conventional encoders slowly search the embedding space for task-relevant structure, whereas ACE-Mol immediately aligns its representations with the task. ACE-Mol achieves state-of-the-art performance across molecular property prediction benchmarks with interpretable, chemically meaningful representations.
Paper Structure (71 sections, 3 equations, 12 figures, 6 tables)

This paper contains 71 sections, 3 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: ACE-Mol adapts to the task space. ACE-Mol uses task conditioning to adapt to the task-specific subspace. Previous models rely solely on labels to reorganize the embedding space. ACE-Mol's embeddings are inherently task adaptive; previous approaches' embeddings are static.
  • Figure 2: ACE-Mol's representations are task-specific. ACE-Mol learns to associate molecules based on their shared properties. The embedded molecules $M$=[$m_1$, $m_2$, $m_3$, $m_4$] are associated based of their shared property value $y$ within each of the tasks $T$ sub-spaces. As a result, molecules such as $m_1$ and $m_1$ can be represented as close in the embedding space for task $t_1$, while being farther apart for tasks $t_2$ and $t_3$.
  • Figure 3: ACE-Mol's embeddings cluster based on functional groups. Representations are extracted from the hold-out test set molecules (scaffold-split) used to pretrain ACE-Mol. The baseline models are MolFormer molformer, MolT5 edwards-etal-2022-translation and ChemBERTa ahmad2022chemberta020. The target classes correspond to the presence of functional groups.
  • Figure 4: Global centroid movement. Comparison of normalized embedding centroid movement of ACE-Mol and ChemBERTaV2 by computing the embedding centroid distance between models fine-tuned with $N$ and $N-1$ data points. The reported embedding centroid distance is the mean distance across all the toxicity benchmark tasks for both the toxic and non-toxic centroids.
  • Figure 5: Local embedding change. Recall of local embedding neighbourhoods across fine-tuned ACE-Mol and ChemBERTaV2 models. Reported recall is the mean difference of $k=5$ nearest neighbourhoods across all of the neighbourhoods across all 14 tasks on the toxicity benchmark between the model fine-tuned with $N$ data points and the pretrained model.
  • ...and 7 more figures