Look the Other Way: Designing 'Positive' Molecules with Negative Data via Task Arithmetic
Rıza Özçelik, Sarah de Ruiter, Francesca Grisoni
TL;DR
Molecular Task Arithmetic (MTA) introduces a negative-data–driven transfer-learning strategy for de novo molecule design, learning a task direction from non-desirable molecules and moving the model in the opposite weight-space direction to generate positives. Across 33 design tasks—including single- and dual-objective ligand design, docking to multiple targets, and protein design—MTA consistently enhances design diversity and often outperforms positive-data finetuning, while preserving SMILES validity. The approach enables zero-shot design, improves few-shot performance when combined with limited positive data, and scales to complex objectives like protein design, suggesting it could become a foundational transfer-learning paradigm in drug discovery. The work also analyzes behavior under distribution shifts and outlines practical guidelines for tuning the task vector magnitude, highlighting robustness and limitations. Overall, MTA offers a data-efficient, versatile framework that leverages abundant negative data to expand chemical space exploration with high-quality, diverse hits.
Abstract
The scarcity of molecules with desirable properties (i.e., `positive' molecules) is an inherent bottleneck for generative molecule design. To sidestep such obstacle, here we propose molecular task arithmetic: training a model on diverse and abundant negative examples to learn 'property directions' - without accessing any positively labeled data - and moving models in the opposite property directions to generate positive molecules. When analyzed on 33 design experiments with distinct molecular entities (small molecules, proteins), model architectures, and scales, molecular task arithmetic generated more diverse and successful designs than models trained on positive molecules in general. Moreover, we employed molecular task arithmetic in dual-objective and few-shot design tasks. We find that molecular task arithmetic can consistently increase the diversity of designs while maintaining desirable complex design properties, such as good docking scores to a protein. With its simplicity, data efficiency, and performance, molecular task arithmetic bears the potential to become the de facto transfer learning strategy for de novo molecule design.
