Table of Contents
Fetching ...

Look the Other Way: Designing 'Positive' Molecules with Negative Data via Task Arithmetic

Rıza Özçelik, Sarah de Ruiter, Francesca Grisoni

TL;DR

Molecular Task Arithmetic (MTA) introduces a negative-data–driven transfer-learning strategy for de novo molecule design, learning a task direction from non-desirable molecules and moving the model in the opposite weight-space direction to generate positives. Across 33 design tasks—including single- and dual-objective ligand design, docking to multiple targets, and protein design—MTA consistently enhances design diversity and often outperforms positive-data finetuning, while preserving SMILES validity. The approach enables zero-shot design, improves few-shot performance when combined with limited positive data, and scales to complex objectives like protein design, suggesting it could become a foundational transfer-learning paradigm in drug discovery. The work also analyzes behavior under distribution shifts and outlines practical guidelines for tuning the task vector magnitude, highlighting robustness and limitations. Overall, MTA offers a data-efficient, versatile framework that leverages abundant negative data to expand chemical space exploration with high-quality, diverse hits.

Abstract

The scarcity of molecules with desirable properties (i.e., `positive' molecules) is an inherent bottleneck for generative molecule design. To sidestep such obstacle, here we propose molecular task arithmetic: training a model on diverse and abundant negative examples to learn 'property directions' - without accessing any positively labeled data - and moving models in the opposite property directions to generate positive molecules. When analyzed on 33 design experiments with distinct molecular entities (small molecules, proteins), model architectures, and scales, molecular task arithmetic generated more diverse and successful designs than models trained on positive molecules in general. Moreover, we employed molecular task arithmetic in dual-objective and few-shot design tasks. We find that molecular task arithmetic can consistently increase the diversity of designs while maintaining desirable complex design properties, such as good docking scores to a protein. With its simplicity, data efficiency, and performance, molecular task arithmetic bears the potential to become the de facto transfer learning strategy for de novo molecule design.

Look the Other Way: Designing 'Positive' Molecules with Negative Data via Task Arithmetic

TL;DR

Molecular Task Arithmetic (MTA) introduces a negative-data–driven transfer-learning strategy for de novo molecule design, learning a task direction from non-desirable molecules and moving the model in the opposite weight-space direction to generate positives. Across 33 design tasks—including single- and dual-objective ligand design, docking to multiple targets, and protein design—MTA consistently enhances design diversity and often outperforms positive-data finetuning, while preserving SMILES validity. The approach enables zero-shot design, improves few-shot performance when combined with limited positive data, and scales to complex objectives like protein design, suggesting it could become a foundational transfer-learning paradigm in drug discovery. The work also analyzes behavior under distribution shifts and outlines practical guidelines for tuning the task vector magnitude, highlighting robustness and limitations. Overall, MTA offers a data-efficient, versatile framework that leverages abundant negative data to expand chemical space exploration with high-quality, diverse hits.

Abstract

The scarcity of molecules with desirable properties (i.e., `positive' molecules) is an inherent bottleneck for generative molecule design. To sidestep such obstacle, here we propose molecular task arithmetic: training a model on diverse and abundant negative examples to learn 'property directions' - without accessing any positively labeled data - and moving models in the opposite property directions to generate positive molecules. When analyzed on 33 design experiments with distinct molecular entities (small molecules, proteins), model architectures, and scales, molecular task arithmetic generated more diverse and successful designs than models trained on positive molecules in general. Moreover, we employed molecular task arithmetic in dual-objective and few-shot design tasks. We find that molecular task arithmetic can consistently increase the diversity of designs while maintaining desirable complex design properties, such as good docking scores to a protein. With its simplicity, data efficiency, and performance, molecular task arithmetic bears the potential to become the de facto transfer learning strategy for de novo molecule design.

Paper Structure

This paper contains 23 sections, 4 equations, 22 figures, 7 tables.

Figures (22)

  • Figure 1: Molecular task arithmetic (MTA).(a) Zero-shot design. Molecular task arithmetic learns a task direction in the model weight space by finetuning on negative molecules (dashed purple arrow). The task vector is traversed in the opposite direction (solid orange arrow). (b) Multi-objective design. Multiple task directions are learned independently, and then combined. (c) Few-shot design. Task arithmetic is applied on the model finetuned with negatives, and then known positive molecules are used (dashed orange arrow).
  • Figure 2: Zero-shot single-objective molecule design. Molecular task arithmetic vs fine-tuning, across 10 design tasks (100,000 molecular designs). The pretrained model is included as a baseline, and average and standard deviation are reported. (a) number of cluster centers that possess the desired property; (b) ratio of designs that satisfy the design task; (c) number of clusters.
  • Figure 3: Out-of-distribution design. The number of successful design clusters was computed at increasing sizes of 'negative' and 'positive' training sets. Mean (solid lines) and standard deviation (shaded areas) are reported (100,000 designs, five training-validation splits).
  • Figure 4: Zero-shot dual objective molecule design. Models are trained on randomized SMILES representation with sequential supervised finetuning and molecular task arithmetic to design molecules that possess two task properties simultaneously (average and standard deviation across five splits).
  • Figure 5: Task vector scaling, across 19 increasing scaling factors ($\lambda$; Eq. \ref{['eq:molecular-ta']}). For each $\lambda$, validity and success rate for 100,000 designs were computed (average and standard deviation; five training splits).
  • ...and 17 more figures