Table of Contents
Fetching ...

Language models can generate molecules, materials, and protein binding sites directly in three dimensions as XYZ, CIF, and PDB files

Daniel Flam-Shepherd, Alán Aspuru-Guzik

TL;DR

The paper demonstrates that language models trained to predict the next token can generate three-dimensional molecular, crystalline, and protein-pocket structures directly from textual representations of XYZ, CIF, and PDB files. By converting these structures into token sequences and using tokenization schemes that include atom types and coordinates, the authors show that transformers can learn valid 3D distributions without specialized 3D or equivariant architectures. They employ data augmentation through rotations to mitigate the lack of invariances and compare against state-of-the-art 3D generative models and graph-/string-based approaches, achieving competitive results across molecules, crystals, and protein pockets. This work suggests language models are a powerful, architecture-agnostic tool for exploring chemical space in 3D, with potential for inverse design and structure-based discovery.

Abstract

Language models are powerful tools for molecular design. Currently, the dominant paradigm is to parse molecular graphs into linear string representations that can easily be trained on. This approach has been very successful, however, it is limited to chemical structures that can be completely represented by a graph -- like organic molecules -- while materials and biomolecular structures like protein binding sites require a more complete representation that includes the relative positioning of their atoms in space. In this work, we show how language models, without any architecture modifications, trained using next-token prediction -- can generate novel and valid structures in three dimensions from various substantially different distributions of chemical structures. In particular, we demonstrate that language models trained directly on sequences derived directly from chemical file formats like XYZ files, Crystallographic Information files (CIFs), or Protein Data Bank files (PDBs) can directly generate molecules, crystals, and protein binding sites in three dimensions. Furthermore, despite being trained on chemical file sequences -- language models still achieve performance comparable to state-of-the-art models that use graph and graph-derived string representations, as well as other domain-specific 3D generative models. In doing so, we demonstrate that it is not necessary to use simplified molecular representations to train chemical language models -- that they are powerful generative models capable of directly exploring chemical space in three dimensions for very different structures.

Language models can generate molecules, materials, and protein binding sites directly in three dimensions as XYZ, CIF, and PDB files

TL;DR

The paper demonstrates that language models trained to predict the next token can generate three-dimensional molecular, crystalline, and protein-pocket structures directly from textual representations of XYZ, CIF, and PDB files. By converting these structures into token sequences and using tokenization schemes that include atom types and coordinates, the authors show that transformers can learn valid 3D distributions without specialized 3D or equivariant architectures. They employ data augmentation through rotations to mitigate the lack of invariances and compare against state-of-the-art 3D generative models and graph-/string-based approaches, achieving competitive results across molecules, crystals, and protein pockets. This work suggests language models are a powerful, architecture-agnostic tool for exploring chemical space in 3D, with potential for inverse design and structure-based discovery.

Abstract

Language models are powerful tools for molecular design. Currently, the dominant paradigm is to parse molecular graphs into linear string representations that can easily be trained on. This approach has been very successful, however, it is limited to chemical structures that can be completely represented by a graph -- like organic molecules -- while materials and biomolecular structures like protein binding sites require a more complete representation that includes the relative positioning of their atoms in space. In this work, we show how language models, without any architecture modifications, trained using next-token prediction -- can generate novel and valid structures in three dimensions from various substantially different distributions of chemical structures. In particular, we demonstrate that language models trained directly on sequences derived directly from chemical file formats like XYZ files, Crystallographic Information files (CIFs), or Protein Data Bank files (PDBs) can directly generate molecules, crystals, and protein binding sites in three dimensions. Furthermore, despite being trained on chemical file sequences -- language models still achieve performance comparable to state-of-the-art models that use graph and graph-derived string representations, as well as other domain-specific 3D generative models. In doing so, we demonstrate that it is not necessary to use simplified molecular representations to train chemical language models -- that they are powerful generative models capable of directly exploring chemical space in three dimensions for very different structures.
Paper Structure (11 sections, 4 equations, 8 figures, 2 tables)

This paper contains 11 sections, 4 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: A) The training datasets of structures that we benchmark language models on in this work. B) The overview of the training workflow -- chemical file formats are converted to sequences of tokens using either character or coordinate-level tokenization. The language model is trained to predict the next token in these sequences.
  • Figure 2: A histogram of root mean squared deviations in atomic positions between 10K molecules sampled from the language model and their corresponding conformers generated by rdkit. Six example molecules and geometries with various r.m.s.d. values are visualized explicitly and compared with their rdkit conformers.
  • Figure 3: A) Protein pockets are pre-processed by removing residues far from the center of the pocket-ligand complex. B) A comparison between the model and training data distribution of interatomic distance between 10 random pocket atoms and the closest and furthest pocket atoms. Additionally, we show a box plot for the number of carbon, nitrogen, and oxygen C) Six different examples from the training data and sampled from the language model the first 3 are plotted showing individual atoms, and the last three show the surface of the pocket.
  • Figure S1: a) Examples of training molecules in three dimensions. b) Samples of molecules generated by the language model.
  • Figure S2: example molecules and geometries with various r.m.s.d. values are visualized explicitly and compared with their rdkit conformers.
  • ...and 3 more figures