G2T-LLM: Graph-to-Tree Text Encoding for Molecule Generation with Fine-Tuned Large Language Models

Zhaoning Yu; Xiangyang Xu; Hongyang Gao

G2T-LLM: Graph-to-Tree Text Encoding for Molecule Generation with Fine-Tuned Large Language Models

Zhaoning Yu, Xiangyang Xu, Hongyang Gao

TL;DR

G2T-LLM presents a graph-to-tree text encoding that serializes molecular graphs into tree-structured formats like JSON/XML, enabling large language models to generate valid and diverse molecules. The approach combines token constraints and supervised fine-tuning to guide LLM outputs toward chemically coherent structures, addressing invalid outputs common in graph-based methods. Experiments on QM9 and ZINC250k show strong validity and high novelty, with results competitive to state-of-the-art baselines and clear ablation evidence supporting the encoding choice, fine-tuning, and token-constraining strategies. Overall, the work demonstrates that aligning molecular representations with LLM training data and objectives can yield flexible, human-guided molecular design capabilities with practical potential in drug discovery and materials engineering.

Abstract

We introduce G2T-LLM, a novel approach for molecule generation that uses graph-to-tree text encoding to transform graph-based molecular structures into a hierarchical text format optimized for large language models (LLMs). This encoding converts complex molecular graphs into tree-structured formats, such as JSON and XML, which LLMs are particularly adept at processing due to their extensive pre-training on these types of data. By leveraging the flexibility of LLMs, our approach allows for intuitive interaction using natural language prompts, providing a more accessible interface for molecular design. Through supervised fine-tuning, G2T-LLM generates valid and coherent chemical structures, addressing common challenges like invalid outputs seen in traditional graph-based methods. While LLMs are computationally intensive, they offer superior generalization and adaptability, enabling the generation of diverse molecular structures with minimal task-specific customization. The proposed approach achieved comparable performances with state-of-the-art methods on various benchmark molecular generation datasets, demonstrating its potential as a flexible and innovative tool for AI-driven molecular design.

G2T-LLM: Graph-to-Tree Text Encoding for Molecule Generation with Fine-Tuned Large Language Models

TL;DR

Abstract

Paper Structure (18 sections, 4 figures, 7 tables, 2 algorithms)

This paper contains 18 sections, 4 figures, 7 tables, 2 algorithms.

Introduction
Related Work
G2T-LLM
Challenges and Motivations
Graph-to-Tree Text Encoding
Token Constraining for Valid Tree-Structure Generation
Supervised Fine-Tuning LLMs for Molecular Generation
Inference Process of G2T-LLM
Experiments
Experimental Setup
Experimental Results
Visualization Results of Generated Molecules
Ablation Study: Impact of Tree-Structured Text Encoding
Ablation Study: Impact of supervised Fine-Tuning LLM
Ablation Study: Impact of size of the Fine-Tuning dataset
...and 3 more sections

Figures (4)

Figure 1: Illustration of the Graph-to-Tree Text Encoding process described in Section \ref{['sec:graph2tree']} and Algorithm \ref{['alg:graph2tree']}. This figure shows how the molecular structure of cyclopropene is transformed into a hierarchical tree representation. Each atom and bond is mapped to nodes and edges in the tree, with unique identifiers assigned.
Figure 2: An illustration of the supervised fine-tuning process of G2T-LLM. The process begins by randomly selecting a starting component, exemplified by cyclopropene, which is encoded into a partial tree structure and passed as a prompt to the LLM. The LLM generates the remaining molecular structure, which is compared against the ground truth. A loss is computed and is used to fine-tune the model, iteratively improving its performance in generating valid molecular graphs.
Figure 3: An illustration of the inference process of G2T-LLM. The process starts by prompting the model with a random molecular component. The model, a fine-tuned LLM (SFT-LLM), generates new molecular structures while applying token constraints to ensure valid outputs. The output is a tree-structured text representing the molecule. It is then decoded back into a molecular graph corresponding to cyclopropene.
Figure 4: Visualization of the generated molecules with Tanimoto similarity scores based on Morgan fingerprints. The best results are highlighted in bold.

G2T-LLM: Graph-to-Tree Text Encoding for Molecule Generation with Fine-Tuned Large Language Models

TL;DR

Abstract

G2T-LLM: Graph-to-Tree Text Encoding for Molecule Generation with Fine-Tuned Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)