The Tokenization Bottleneck: How Vocabulary Extension Improves Chemistry Representation Learning in Pretrained Language Models
Prathamesh Kalamkar, Ned Letcher, Meissane Chami, Sahger Lad, Shayan Mohanty, Prasanna Pendse
TL;DR
This work addresses the tokenization bottleneck that limits chemistry applications of large language models by unifying natural language and molecular representations in a single model. It introduces a data-driven vocabulary extension that adds 17,795 chemistry-relevant tokens and couples this with continued pretraining on a blended corpus of chemistry and general-domain data. Experiments with Llama3-8B show that vocabulary extension plus continued pretraining yieldsConsistent gains on SMolInstruct tasks, indicating improved joint understanding of text and SMILES, though some numerical-property predictions remain challenging. The findings demonstrate that chemistry-aware tokenization and joint representation learning can produce more efficient, capable foundation models for chemical reasoning and instruction tasks, with potential for integration with external chemistry tools.
Abstract
The application of large language models (LLMs) to chemistry is frequently hampered by a "tokenization bottleneck", where tokenizers tuned on general-domain text tend to fragment chemical representations such as SMILES into semantically uninformative sub-tokens. This paper introduces a principled methodology to resolve this bottleneck by unifying the representation of natural language and molecular structures within a single model. Our approach involves targeted vocabulary extension-augmenting a pretrained LLM's vocabulary with chemically salient tokens, followed by continued pretraining on chemistry-domain text to integrate this new knowledge. We provide an empirical demonstration of the effectiveness of this strategy, showing that our methodology leads to superior performance on a range of downstream chemical tasks.
