Table of Contents
Fetching ...

G2T-LLM: Graph-to-Tree Text Encoding for Molecule Generation with Fine-Tuned Large Language Models

Zhaoning Yu, Xiangyang Xu, Hongyang Gao

TL;DR

G2T-LLM presents a graph-to-tree text encoding that serializes molecular graphs into tree-structured formats like JSON/XML, enabling large language models to generate valid and diverse molecules. The approach combines token constraints and supervised fine-tuning to guide LLM outputs toward chemically coherent structures, addressing invalid outputs common in graph-based methods. Experiments on QM9 and ZINC250k show strong validity and high novelty, with results competitive to state-of-the-art baselines and clear ablation evidence supporting the encoding choice, fine-tuning, and token-constraining strategies. Overall, the work demonstrates that aligning molecular representations with LLM training data and objectives can yield flexible, human-guided molecular design capabilities with practical potential in drug discovery and materials engineering.

Abstract

We introduce G2T-LLM, a novel approach for molecule generation that uses graph-to-tree text encoding to transform graph-based molecular structures into a hierarchical text format optimized for large language models (LLMs). This encoding converts complex molecular graphs into tree-structured formats, such as JSON and XML, which LLMs are particularly adept at processing due to their extensive pre-training on these types of data. By leveraging the flexibility of LLMs, our approach allows for intuitive interaction using natural language prompts, providing a more accessible interface for molecular design. Through supervised fine-tuning, G2T-LLM generates valid and coherent chemical structures, addressing common challenges like invalid outputs seen in traditional graph-based methods. While LLMs are computationally intensive, they offer superior generalization and adaptability, enabling the generation of diverse molecular structures with minimal task-specific customization. The proposed approach achieved comparable performances with state-of-the-art methods on various benchmark molecular generation datasets, demonstrating its potential as a flexible and innovative tool for AI-driven molecular design.

G2T-LLM: Graph-to-Tree Text Encoding for Molecule Generation with Fine-Tuned Large Language Models

TL;DR

G2T-LLM presents a graph-to-tree text encoding that serializes molecular graphs into tree-structured formats like JSON/XML, enabling large language models to generate valid and diverse molecules. The approach combines token constraints and supervised fine-tuning to guide LLM outputs toward chemically coherent structures, addressing invalid outputs common in graph-based methods. Experiments on QM9 and ZINC250k show strong validity and high novelty, with results competitive to state-of-the-art baselines and clear ablation evidence supporting the encoding choice, fine-tuning, and token-constraining strategies. Overall, the work demonstrates that aligning molecular representations with LLM training data and objectives can yield flexible, human-guided molecular design capabilities with practical potential in drug discovery and materials engineering.

Abstract

We introduce G2T-LLM, a novel approach for molecule generation that uses graph-to-tree text encoding to transform graph-based molecular structures into a hierarchical text format optimized for large language models (LLMs). This encoding converts complex molecular graphs into tree-structured formats, such as JSON and XML, which LLMs are particularly adept at processing due to their extensive pre-training on these types of data. By leveraging the flexibility of LLMs, our approach allows for intuitive interaction using natural language prompts, providing a more accessible interface for molecular design. Through supervised fine-tuning, G2T-LLM generates valid and coherent chemical structures, addressing common challenges like invalid outputs seen in traditional graph-based methods. While LLMs are computationally intensive, they offer superior generalization and adaptability, enabling the generation of diverse molecular structures with minimal task-specific customization. The proposed approach achieved comparable performances with state-of-the-art methods on various benchmark molecular generation datasets, demonstrating its potential as a flexible and innovative tool for AI-driven molecular design.
Paper Structure (18 sections, 4 figures, 7 tables, 2 algorithms)

This paper contains 18 sections, 4 figures, 7 tables, 2 algorithms.

Figures (4)

  • Figure 1: Illustration of the Graph-to-Tree Text Encoding process described in Section \ref{['sec:graph2tree']} and Algorithm \ref{['alg:graph2tree']}. This figure shows how the molecular structure of cyclopropene is transformed into a hierarchical tree representation. Each atom and bond is mapped to nodes and edges in the tree, with unique identifiers assigned.
  • Figure 2: An illustration of the supervised fine-tuning process of G2T-LLM. The process begins by randomly selecting a starting component, exemplified by cyclopropene, which is encoded into a partial tree structure and passed as a prompt to the LLM. The LLM generates the remaining molecular structure, which is compared against the ground truth. A loss is computed and is used to fine-tune the model, iteratively improving its performance in generating valid molecular graphs.
  • Figure 3: An illustration of the inference process of G2T-LLM. The process starts by prompting the model with a random molecular component. The model, a fine-tuned LLM (SFT-LLM), generates new molecular structures while applying token constraints to ensure valid outputs. The output is a tree-structured text representing the molecule. It is then decoded back into a molecular graph corresponding to cyclopropene.
  • Figure 4: Visualization of the generated molecules with Tanimoto similarity scores based on Morgan fingerprints. The best results are highlighted in bold.