Graph-based Molecular In-context Learning Grounded on Morgan Fingerprints
Ali Al-Lawati, Jason Lucas, Zhiwei Zhang, Prasenjit Mitra, Suhang Wang
TL;DR
GAMIC advances molecular ICL by integrating graph-based molecular representations with text captions through graph–language alignment, enabled by Morgan fingerprint sampling and contrastive learning, plus MMR-based diverse demonstration retrieval. The framework targets small-to-mid-sized LLMs to avoid heavy fine-tuning while delivering strong performance across molecule captioning, property prediction, and yield prediction. Empirical results across multiple datasets and LLMs show GAMIC achieving up to 45% improvements over Morgan-based baselines and establishing a practical approach for graph-aware multimodal ICL in chemistry. This work broadens the applicability of ICL in molecular tasks, supporting more scalable and flexible deployment in drug discovery and materials science contexts.
Abstract
In-context learning (ICL) effectively conditions large language models (LLMs) for molecular tasks, such as property prediction and molecule captioning, by embedding carefully selected demonstration examples into the input prompt. This approach avoids the computational overhead of extensive pertaining and fine-tuning. However, current prompt retrieval methods for molecular tasks have relied on molecule feature similarity, such as Morgan fingerprints, which do not adequately capture the global molecular and atom-binding relationships. As a result, these methods fail to represent the full complexity of molecular structures during inference. Moreover, small-to-medium-sized LLMs, which offer simpler deployment requirements in specialized systems, have remained largely unexplored in the molecular ICL literature. To address these gaps, we propose a self-supervised learning technique, GAMIC (Graph-Aligned Molecular In-Context learning, which aligns global molecular structures, represented by graph neural networks (GNNs), with textual captions (descriptions) while leveraging local feature similarity through Morgan fingerprints. In addition, we introduce a Maximum Marginal Relevance (MMR) based diversity heuristic during retrieval to optimize input prompt demonstration samples. Our experimental findings using diverse benchmark datasets show GAMIC outperforms simple Morgan-based ICL retrieval methods across all tasks by up to 45%.
