Table of Contents
Fetching ...

AssayMatch: Learning to Select Data for Molecular Activity Models

Vincent Fan, Regina Barzilay

TL;DR

AssayMatch tackles the problem of noisy, heterogeneous assay data by learning to select training assays most compatible with a given test assay without requiring test labels. It computes per-assay TRAK data-attribution scores, finetunes language embeddings of assay descriptions with a contrastive objective based on those scores, and ranks training data to maximize transfer to unseen assays. Across six ChEMBL IC50 targets and two model architectures, AssayMatch demonstrates improved data efficiency and predictive performance, often surpassing models trained on the full dataset and particularly excelling in low-data regimes. This data-driven data curation approach reduces noise from incompatible experiments and offers a scalable pathway for more reliable drug-discovery modeling.

Abstract

The performance of machine learning models in drug discovery is highly dependent on the quality and consistency of the underlying training data. Due to limitations in dataset sizes, many models are trained by aggregating bioactivity data from diverse sources, including public databases such as ChEMBL. However, this approach often introduces significant noise due to variability in experimental protocols. We introduce AssayMatch, a framework for data selection that builds smaller, more homogenous training sets attuned to the test set of interest. AssayMatch leverages data attribution methods to quantify the contribution of each training assay to model performance. These attribution scores are used to finetune language embeddings of text-based assay descriptions to capture not just semantic similarity, but also the compatibility between assays. Unlike existing data attribution methods, our approach enables data selection for a test set with unknown labels, mirroring real-world drug discovery campaigns where the activities of candidate molecules are not known in advance. At test time, embeddings finetuned with AssayMatch are used to rank all available training data. We demonstrate that models trained on data selected by AssayMatch are able to surpass the performance of the model trained on the complete dataset, highlighting its ability to effectively filter out harmful or noisy experiments. We perform experiments on two common machine learning architectures and see increased prediction capability over a strong language-only baseline for 9/12 model-target pairs. AssayMatch provides a data-driven mechanism to curate higher-quality datasets, reducing noise from incompatible experiments and improving the predictive power and data efficiency of models for drug discovery. AssayMatch is available at https://github.com/Ozymandias314/AssayMatch.

AssayMatch: Learning to Select Data for Molecular Activity Models

TL;DR

AssayMatch tackles the problem of noisy, heterogeneous assay data by learning to select training assays most compatible with a given test assay without requiring test labels. It computes per-assay TRAK data-attribution scores, finetunes language embeddings of assay descriptions with a contrastive objective based on those scores, and ranks training data to maximize transfer to unseen assays. Across six ChEMBL IC50 targets and two model architectures, AssayMatch demonstrates improved data efficiency and predictive performance, often surpassing models trained on the full dataset and particularly excelling in low-data regimes. This data-driven data curation approach reduces noise from incompatible experiments and offers a scalable pathway for more reliable drug-discovery modeling.

Abstract

The performance of machine learning models in drug discovery is highly dependent on the quality and consistency of the underlying training data. Due to limitations in dataset sizes, many models are trained by aggregating bioactivity data from diverse sources, including public databases such as ChEMBL. However, this approach often introduces significant noise due to variability in experimental protocols. We introduce AssayMatch, a framework for data selection that builds smaller, more homogenous training sets attuned to the test set of interest. AssayMatch leverages data attribution methods to quantify the contribution of each training assay to model performance. These attribution scores are used to finetune language embeddings of text-based assay descriptions to capture not just semantic similarity, but also the compatibility between assays. Unlike existing data attribution methods, our approach enables data selection for a test set with unknown labels, mirroring real-world drug discovery campaigns where the activities of candidate molecules are not known in advance. At test time, embeddings finetuned with AssayMatch are used to rank all available training data. We demonstrate that models trained on data selected by AssayMatch are able to surpass the performance of the model trained on the complete dataset, highlighting its ability to effectively filter out harmful or noisy experiments. We perform experiments on two common machine learning architectures and see increased prediction capability over a strong language-only baseline for 9/12 model-target pairs. AssayMatch provides a data-driven mechanism to curate higher-quality datasets, reducing noise from incompatible experiments and improving the predictive power and data efficiency of models for drug discovery. AssayMatch is available at https://github.com/Ozymandias314/AssayMatch.

Paper Structure

This paper contains 13 sections, 3 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: CYP3A4 IC50 measurements for the same molecule taken from ChEMBL assays 2296579, 2296580, 2296590. The measurements differ vastly depending on specific details in the assay description, such as the probe substrate or timing conditions.
  • Figure 2: Overview of AssayMatch. First, per assay TRAK scores are computed to elucidate assay to assay relationships. Next, language embeddings of experimental descriptions are finetuned to reflect these relationships. Lastly, the finetuned embeddings are used to rank data with respect to unseen assay descriptions at test time.
  • Figure 3: To evaluate AssayMatch, we randomly select 10 assay descriptions for each target. AssayMatch selects a separate dataset for each test assay, which we train a separate model on. The predictions are aggregated and scored by AUROC to construct a learning curve.
  • Figure 4: Microaveraged AUC scores of Chemprop and SMILES Transformer trained on subsets of different size as selected according to a random baseline, the original language embeddings, and AssayMatch embeddings. The performance of the model trained on the full available dataset is represented when size = 100%. The performance obtained by selecting all training assays that share the same BioAssay Ontology (BAO) label as the test assay is represented by the red dashed line. Results for each individual target (averaged over both architectures) displayed below. AssayMatch is the best strategy for 4/6 targets.
  • Figure 5: A: PCA of assay description embeddings with text-embedding-004 model on all ChEMBL assay descriptions for CYP3A4. Descriptions including similar keywords are clustered together and highlighted. B: Aggregated pairwise TRAK scores between $k$-means clusters of assay embeddings. Inter-cluster TRAK scores along the main diagonal are significantly higher.
  • ...and 2 more figures