Table of Contents
Fetching ...

Robust Molecular Property Prediction via Densifying Scarce Labeled Data

Jina Kim, Jeffrey Willette, Bruno Andreis, Sung Ju Hwang

TL;DR

The paper tackles the challenge of covariate shift and scarce labeled data in molecular property prediction by proposing a bilevel optimization framework that densifies the training distribution through interpolation with abundant unlabeled context via a learnable set function $\mu_\lambda$. The inner loop updates a meta-learner on a densified training signal, while the outer loop optimizes the mixer parameters using hypergradients derived from a meta-validation objective, with pseudo-labels drawn from unlabeled data to regularize. Empirically, the method yields significant MSE improvements over baselines on the Merck Molecular Activity Challenge (HIVPROT, DPP4, NK1) and is supported by t-SNE analyses showing clearer separation between joint, input, context, and OOD embeddings, validating robust generalization under covariate shift. The work demonstrates a practical route to extrapolate to unseen chemical spaces in drug discovery and provides code for replication.

Abstract

A widely recognized limitation of molecular prediction models is their reliance on structures observed in the training data, resulting in poor generalization to out-of-distribution compounds. Yet in drug discovery, the compounds most critical for advancing research often lie beyond the training set, making the bias toward the training data particularly problematic. This mismatch introduces substantial covariate shift, under which standard deep learning models produce unstable and inaccurate predictions. Furthermore, the scarcity of labeled data-stemming from the onerous and costly nature of experimental validation-further exacerbates the difficulty of achieving reliable generalization. To address these limitations, we propose a novel bilevel optimization approach that leverages unlabeled data to interpolate between in-distribution (ID) and out-of-distribution (OOD) data, enabling the model to learn how to generalize beyond the training distribution. We demonstrate significant performance gains on challenging real-world datasets with substantial covariate shift, supported by t-SNE visualizations highlighting our interpolation method.

Robust Molecular Property Prediction via Densifying Scarce Labeled Data

TL;DR

The paper tackles the challenge of covariate shift and scarce labeled data in molecular property prediction by proposing a bilevel optimization framework that densifies the training distribution through interpolation with abundant unlabeled context via a learnable set function . The inner loop updates a meta-learner on a densified training signal, while the outer loop optimizes the mixer parameters using hypergradients derived from a meta-validation objective, with pseudo-labels drawn from unlabeled data to regularize. Empirically, the method yields significant MSE improvements over baselines on the Merck Molecular Activity Challenge (HIVPROT, DPP4, NK1) and is supported by t-SNE analyses showing clearer separation between joint, input, context, and OOD embeddings, validating robust generalization under covariate shift. The work demonstrates a practical route to extrapolate to unseen chemical spaces in drug discovery and provides code for replication.

Abstract

A widely recognized limitation of molecular prediction models is their reliance on structures observed in the training data, resulting in poor generalization to out-of-distribution compounds. Yet in drug discovery, the compounds most critical for advancing research often lie beyond the training set, making the bias toward the training data particularly problematic. This mismatch introduces substantial covariate shift, under which standard deep learning models produce unstable and inaccurate predictions. Furthermore, the scarcity of labeled data-stemming from the onerous and costly nature of experimental validation-further exacerbates the difficulty of achieving reliable generalization. To address these limitations, we propose a novel bilevel optimization approach that leverages unlabeled data to interpolate between in-distribution (ID) and out-of-distribution (OOD) data, enabling the model to learn how to generalize beyond the training distribution. We demonstrate significant performance gains on challenging real-world datasets with substantial covariate shift, supported by t-SNE visualizations highlighting our interpolation method.

Paper Structure

This paper contains 14 sections, 6 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Concept. We densify the train dataset using external unlabeled data (context point) for robust generalization across covariate shift. Notation details are provided in \ref{['analysis']}.
  • Figure 2: t-SNE visualization of DPP4 (bit) dataset from the penultimate layer across different methods. All models were trained on $\mathcal{D}_\text{train}$, $\mathcal{D}_\text{context}$, and $\mathcal{D}_\text{mvalid}$. At test time, we evaluate each model on four input variants (orange, blue, green, purple) to analyze how the model utilizes $\mathcal{D}_\text{context}$ to achieve robustness under covariate shift and how it behaves on out-of-distribution (OOD) data.
  • Figure 3: Overview of our proposed model. (a) During training, the model interpolates between a labeled train point $(x_i, y_i)$ and context point $C_i$ to learn robust representations. At test time, the model predicts on an OOD input using the learned meta learner $f_\theta$ and set function $\mu_\lambda$. (b) The model is trained via bilevel optimization, where the inner loop updates $\theta$ using the inner loss $L_{\text{inner}}$, while the outer loop updates $\lambda$ using the hypergradient computed from $L_T$ and $L_V$.
  • Figure 4: t-SNE visualization of the model trained on the HIVPROT (count) dataset
  • Figure 5: t-SNE visualization of the model trained on the HIVPROT (bit) dataset
  • ...and 4 more figures