Chemical Language Models for Natural Products: A State-Space Model Approach
Ho-Hsuan Wang, Afnan Sultan, Andrea Volkamer, Dietrich Klakow
TL;DR
The paper evaluates NP-focused chemical language models by comparing selective state-space models (Mamba, Mamba-2) against GPT on a 1M NP SMILES corpus, across eight tokenizers, and two downstream tasks. It shows that smaller, chemistry-informed tokenizers generally outperform large-vocabulary options, and that domain-specific pre-training on NP data can match or rival models pre-trained on datasets hundreds of times larger. Mamba delivers higher validity and structural consistency with fewer long-range errors, while GPT achieves greater novelty; property-prediction performance is strongly influenced by tokenizer choice and data-splitting strategy, with NP-specific pre-training from scratch approaching the performance of large-scale baselines. Overall, the results highlight the value of data relevance and tokenizer design in NP CLMs and suggest broader implications for dense symbolic sequence modeling beyond natural products.
Abstract
Language models are widely used in chemistry for molecular property prediction and small-molecule generation, yet Natural Products (NPs) remain underexplored despite their importance in drug discovery. To address this gap, we develop NP-specific chemical language models (NPCLMs) by pre-training state-space models (Mamba and Mamba-2) and comparing them with transformer baselines (GPT). Using a dataset of about 1M NPs, we present the first systematic comparison of selective state-space models and transformers for NP-focused tasks, together with eight tokenization strategies including character-level, Atom-in-SMILES (AIS), byte-pair encoding (BPE), and NP-specific BPE. We evaluate molecule generation (validity, uniqueness, novelty) and property prediction (membrane permeability, taste, anti-cancer activity) using MCC and AUC-ROC. Mamba generates 1-2 percent more valid and unique molecules than Mamba-2 and GPT, with fewer long-range dependency errors, while GPT yields slightly more novel structures. For property prediction, Mamba variants outperform GPT by 0.02-0.04 MCC under random splits, while scaffold splits show comparable performance. Results demonstrate that domain-specific pre-training on about 1M NPs can match models trained on datasets over 100 times larger.
