G2T-LLM: Graph-to-Tree Text Encoding for Molecule Generation with Fine-Tuned Large Language Models
Zhaoning Yu, Xiangyang Xu, Hongyang Gao
TL;DR
G2T-LLM presents a graph-to-tree text encoding that serializes molecular graphs into tree-structured formats like JSON/XML, enabling large language models to generate valid and diverse molecules. The approach combines token constraints and supervised fine-tuning to guide LLM outputs toward chemically coherent structures, addressing invalid outputs common in graph-based methods. Experiments on QM9 and ZINC250k show strong validity and high novelty, with results competitive to state-of-the-art baselines and clear ablation evidence supporting the encoding choice, fine-tuning, and token-constraining strategies. Overall, the work demonstrates that aligning molecular representations with LLM training data and objectives can yield flexible, human-guided molecular design capabilities with practical potential in drug discovery and materials engineering.
Abstract
We introduce G2T-LLM, a novel approach for molecule generation that uses graph-to-tree text encoding to transform graph-based molecular structures into a hierarchical text format optimized for large language models (LLMs). This encoding converts complex molecular graphs into tree-structured formats, such as JSON and XML, which LLMs are particularly adept at processing due to their extensive pre-training on these types of data. By leveraging the flexibility of LLMs, our approach allows for intuitive interaction using natural language prompts, providing a more accessible interface for molecular design. Through supervised fine-tuning, G2T-LLM generates valid and coherent chemical structures, addressing common challenges like invalid outputs seen in traditional graph-based methods. While LLMs are computationally intensive, they offer superior generalization and adaptability, enabling the generation of diverse molecular structures with minimal task-specific customization. The proposed approach achieved comparable performances with state-of-the-art methods on various benchmark molecular generation datasets, demonstrating its potential as a flexible and innovative tool for AI-driven molecular design.
