Table of Contents
Fetching ...

Few-shot Molecular Property Prediction: A Survey

Zeyu Wang, Tianyi Jiang, Huanchang Ma, Yao Lu, Xiaoze Bao, Shanqing Yu, Qi Xuan, Shirui Pan, Xin Zheng

TL;DR

Few-shot molecular property prediction (FSMPP) tackles predicting molecular properties under scarce annotations, a common bottleneck in drug discovery. The paper presents the first comprehensive survey, introducing a unified taxonomy across data-level, model-level, and learning-paradigm methods, and reviews representative approaches, datasets, and evaluation protocols. It identifies two core generalization challenges—cross-property distribution shifts and cross-molecule heterogeneity—and highlights current trends toward data- and model-centric strategies with emerging hybrid approaches. The analysis emphasizes the practical impact of FSMPP for rapid, resource-efficient molecular design and outlines opportunities in theory, multi-modal knowledge integration, scalability, and interpretability to guide future research and real-world pipelines.

Abstract

AI-assisted molecular property prediction has become a promising technique in early-stage drug discovery and materials design in recent years. However, due to high-cost and complex wet-lab experiments, real-world molecules usually experience the issue of scarce annotations, leading to limited labeled data for effective supervised AI model learning. In light of this, few-shot molecular property prediction (FSMPP) has emerged as an expressive paradigm that enables learning from only a few labeled examples. Despite rapidly growing attention, existing FSMPP studies remain fragmented, without a coherent framework to capture methodological advances and domain-specific challenges. In this work, we present the first comprehensive and systematic survey of few-shot molecular property prediction. We begin by analyzing the few-shot phenomenon in molecular datasets and highlighting two core challenges: (1) cross-property generalization under distribution shifts, where each task corresponding to each property, may follow a different data distribution or even be inherently weakly related to others from a biochemical perspective, requiring the model to transfer knowledge across heterogeneous prediction tasks, and (2) cross-molecule generalization under structural heterogeneity, where molecules involved in different or same properties may exhibit significant structural diversity, making model difficult to achieve generalization. Then, we introduce a unified taxonomy that organizes existing methods into data, model, and learning paradigm levels, reflecting their strategies for extracting knowledge from scarce supervision in few-shot molecular property prediction. Next, we compare representative methods, summarize benchmark datasets and evaluation protocols. In the end, we identify key trends and future directions for advancing the continued research on FSMPP.

Few-shot Molecular Property Prediction: A Survey

TL;DR

Few-shot molecular property prediction (FSMPP) tackles predicting molecular properties under scarce annotations, a common bottleneck in drug discovery. The paper presents the first comprehensive survey, introducing a unified taxonomy across data-level, model-level, and learning-paradigm methods, and reviews representative approaches, datasets, and evaluation protocols. It identifies two core generalization challenges—cross-property distribution shifts and cross-molecule heterogeneity—and highlights current trends toward data- and model-centric strategies with emerging hybrid approaches. The analysis emphasizes the practical impact of FSMPP for rapid, resource-efficient molecular design and outlines opportunities in theory, multi-modal knowledge integration, scalability, and interpretability to guide future research and real-world pipelines.

Abstract

AI-assisted molecular property prediction has become a promising technique in early-stage drug discovery and materials design in recent years. However, due to high-cost and complex wet-lab experiments, real-world molecules usually experience the issue of scarce annotations, leading to limited labeled data for effective supervised AI model learning. In light of this, few-shot molecular property prediction (FSMPP) has emerged as an expressive paradigm that enables learning from only a few labeled examples. Despite rapidly growing attention, existing FSMPP studies remain fragmented, without a coherent framework to capture methodological advances and domain-specific challenges. In this work, we present the first comprehensive and systematic survey of few-shot molecular property prediction. We begin by analyzing the few-shot phenomenon in molecular datasets and highlighting two core challenges: (1) cross-property generalization under distribution shifts, where each task corresponding to each property, may follow a different data distribution or even be inherently weakly related to others from a biochemical perspective, requiring the model to transfer knowledge across heterogeneous prediction tasks, and (2) cross-molecule generalization under structural heterogeneity, where molecules involved in different or same properties may exhibit significant structural diversity, making model difficult to achieve generalization. Then, we introduce a unified taxonomy that organizes existing methods into data, model, and learning paradigm levels, reflecting their strategies for extracting knowledge from scarce supervision in few-shot molecular property prediction. Next, we compare representative methods, summarize benchmark datasets and evaluation protocols. In the end, we identify key trends and future directions for advancing the continued research on FSMPP.

Paper Structure

This paper contains 30 sections, 11 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Difference between MPP and FSMPP, where $y_i$ is the prediction of molecule $x_i$, $y_{i,\tau}$ is the prediction under task $\tau$, $S_\tau$, $Q_\tau$ denotes the support set and query set of task $\tau$. With $|D_{\text{train}}| \gg |S_\tau|$, FSMPP requires generalization across both molecules and tasks under very limited supervision.
  • Figure 2: Data statistics of ChEMBL. (A) Distribution of activity annotations per target. The x-axis shows the number of activity annotations (log scale), and the y-axis indicates the number of targets, where a target refers to a biomolecular entity (such as a protein) that can be regarded as a specific prediction task. Two versions are shown: all activities (blue), removing outliers/duplicates and grouping by the main activity type of each target (purple). (B) IC50 distribution of the top-5 targets with the most annotations. The violin plots illustrate the spread and density of IC50 values, where IC50 is a widely used pharmacological indicator of compound activity (lower values indicate stronger activity) for each target. IC50 10 is the binary threshold.
  • Figure 3: The systematic taxonomy of existing FSMPP methods.
  • Figure 4: Analysis of properties distribution and molecular structural similarity. (A) Heatmaps showing Pearson correlation coefficients between molecular properties in QM9 and Alchemy. (B) Histogram and density plots of pairwise cosine similarity of molecular fingerprints in SIDER and Tox21.
  • Figure 5: Solving FSMPP problems by generative molecule data augmentation. This framework shows both how to generate new samples and how to effectively leverage them. Sample generation strategies include: (1) structure modification, such as bonds swapping, atoms dropping, and substructure replacement, etc.; (2) mixup in feature or structure space, which blends molecular features or graph structures; and (3) mask-based modification, where atoms, bonds, or substructures are masked to construct auxiliary learning signals. The generated samples are utilized through two learning paradigms: (1) a supervised route, where augmented data is directly added to the training set to alleviate data scarcity; and (2) a self-supervised route, where tasks such as molecule completion or contrastive learning are introduced to enhance models.
  • ...and 2 more figures