Table of Contents
Fetching ...

Bridging Molecular Graphs and Large Language Models

Runze Wang, Mingqi Yang, Yanming Shen

TL;DR

This work addresses the challenge of leveraging large language models for molecular graphs without fine-tuning the LLM backbone. It introduces Graph2Token, which learns a dedicated graph token through a graph tokenizer that aligns with the LLM’s token space via cross-attention, aided by a multi-source molecular-text dataset and IUPAC-name prompts. The approach achieves strong few-shot performance on both classification and regression tasks, often surpassing graph-only baselines and matching fully finetuned LLM methods in data-scarce regimes, while keeping training lightweight. The results suggest that preserving LLM semantics while attaching a graph-specific token enables robust generalization and practical applicability in biomolecular domains with limited labeled data.

Abstract

While Large Language Models (LLMs) have shown exceptional generalization capabilities, their ability to process graph data, such as molecular structures, remains limited. To bridge this gap, this paper proposes Graph2Token, an efficient solution that aligns graph tokens to LLM tokens. The key idea is to represent a graph token with the LLM token vocabulary, without fine-tuning the LLM backbone. To achieve this goal, we first construct a molecule-text paired dataset from multisources, including CHEBI and HMDB, to train a graph structure encoder, which reduces the distance between graphs and texts representations in the feature space. Then, we propose a novel alignment strategy that associates a graph token with LLM tokens. To further unleash the potential of LLMs, we collect molecular IUPAC name identifiers, which are incorporated into the LLM prompts. By aligning molecular graphs as special tokens, we can activate LLM generalization ability to molecular few-shot learning. Extensive experiments on molecular classification and regression tasks demonstrate the effectiveness of our proposed Graph2Token.

Bridging Molecular Graphs and Large Language Models

TL;DR

This work addresses the challenge of leveraging large language models for molecular graphs without fine-tuning the LLM backbone. It introduces Graph2Token, which learns a dedicated graph token through a graph tokenizer that aligns with the LLM’s token space via cross-attention, aided by a multi-source molecular-text dataset and IUPAC-name prompts. The approach achieves strong few-shot performance on both classification and regression tasks, often surpassing graph-only baselines and matching fully finetuned LLM methods in data-scarce regimes, while keeping training lightweight. The results suggest that preserving LLM semantics while attaching a graph-specific token enables robust generalization and practical applicability in biomolecular domains with limited labeled data.

Abstract

While Large Language Models (LLMs) have shown exceptional generalization capabilities, their ability to process graph data, such as molecular structures, remains limited. To bridge this gap, this paper proposes Graph2Token, an efficient solution that aligns graph tokens to LLM tokens. The key idea is to represent a graph token with the LLM token vocabulary, without fine-tuning the LLM backbone. To achieve this goal, we first construct a molecule-text paired dataset from multisources, including CHEBI and HMDB, to train a graph structure encoder, which reduces the distance between graphs and texts representations in the feature space. Then, we propose a novel alignment strategy that associates a graph token with LLM tokens. To further unleash the potential of LLMs, we collect molecular IUPAC name identifiers, which are incorporated into the LLM prompts. By aligning molecular graphs as special tokens, we can activate LLM generalization ability to molecular few-shot learning. Extensive experiments on molecular classification and regression tasks demonstrate the effectiveness of our proposed Graph2Token.

Paper Structure

This paper contains 24 sections, 1 equation, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Different approaches of applying LLMs to molecules.
  • Figure 2: Overall training process of Graph2Token. Stage 1: Pre-training the molecular graph encoder based on the constructed molecular-text dataset. Stage 2: Training the graph tokenizer that can align a graph token to LLM tokens.
  • Figure 3: Illustration of Graph2Token’s architecture on aligning a graph token with LLM vocabulary. Given an input molecular graph, the graph tokenizer first embeds it via pre-trained graph encoder. Then the graph features as the query state associate the compressed LLM token embeddings and retrieve the useful information according to the computed association. To activate the LLM's reasoning ability, IUPAC name and domain tasks are incorporated within the prompt. Finally, the task head outputs predicted values for specific tasks. We can see that Graph2Token doesn't fine-tune LLM backbone.