Table of Contents
Fetching ...

Learning the Language of Protein Structure

Benoit Gaujac, Jérémie Donà, Liviu Copoiu, Timothy Atkinson, Thomas Pierrot, Thomas D. Barrett

TL;DR

This work tackles the challenge of applying sequence-modeling to protein structure by learning a discrete tokenization of 3D backbone geometry through a vector-quantized autoencoder. It combines a graph-based encoder, Finite Scalar Quantization, and an AlphaFold-inspired decoder to reconstruct coordinates with RMSD around $1$-$5$ Å, using codebooks from $K=4096$ to $64000$ tokens. A decoder-only GPT trained on the learned codes can generate novel, designable protein backbones, achieving competitive designability and diversity compared to dedicated diffusion models. The approach enables seamless multimodal integration and provides a scalable, interpretable discrete representation for structure generation and design, with open-source code to facilitate further research.

Abstract

Representation learning and \emph{de novo} generation of proteins are pivotal computational biology tasks. Whilst natural language processing (NLP) techniques have proven highly effective for protein sequence modelling, structure modelling presents a complex challenge, primarily due to its continuous and three-dimensional nature. Motivated by this discrepancy, we introduce an approach using a vector-quantized autoencoder that effectively tokenizes protein structures into discrete representations. This method transforms the continuous, complex space of protein structures into a manageable, discrete format with a codebook ranging from 4096 to 64000 tokens, achieving high-fidelity reconstructions with backbone root mean square deviations (RMSD) of approximately 1-5 Å. To demonstrate the efficacy of our learned representations, we show that a simple GPT model trained on our codebooks can generate novel, diverse, and designable protein structures. Our approach not only provides representations of protein structure, but also mitigates the challenges of disparate modal representations and sets a foundation for seamless, multi-modal integration, enhancing the capabilities of computational methods in protein design.

Learning the Language of Protein Structure

TL;DR

This work tackles the challenge of applying sequence-modeling to protein structure by learning a discrete tokenization of 3D backbone geometry through a vector-quantized autoencoder. It combines a graph-based encoder, Finite Scalar Quantization, and an AlphaFold-inspired decoder to reconstruct coordinates with RMSD around - Å, using codebooks from to tokens. A decoder-only GPT trained on the learned codes can generate novel, designable protein backbones, achieving competitive designability and diversity compared to dedicated diffusion models. The approach enables seamless multimodal integration and provides a scalable, interpretable discrete representation for structure generation and design, with open-source code to facilitate further research.

Abstract

Representation learning and \emph{de novo} generation of proteins are pivotal computational biology tasks. Whilst natural language processing (NLP) techniques have proven highly effective for protein sequence modelling, structure modelling presents a complex challenge, primarily due to its continuous and three-dimensional nature. Motivated by this discrepancy, we introduce an approach using a vector-quantized autoencoder that effectively tokenizes protein structures into discrete representations. This method transforms the continuous, complex space of protein structures into a manageable, discrete format with a codebook ranging from 4096 to 64000 tokens, achieving high-fidelity reconstructions with backbone root mean square deviations (RMSD) of approximately 1-5 Å. To demonstrate the efficacy of our learned representations, we show that a simple GPT model trained on our codebooks can generate novel, diverse, and designable protein structures. Our approach not only provides representations of protein structure, but also mitigates the challenges of disparate modal representations and sets a foundation for seamless, multi-modal integration, enhancing the capabilities of computational methods in protein design.
Paper Structure (48 sections, 6 equations, 22 figures, 2 tables, 4 algorithms)

This paper contains 48 sections, 6 equations, 22 figures, 2 tables, 4 algorithms.

Figures (22)

  • Figure 1: Schematic overview of our approach. The protein structure is first encoded as a graph to extract features from using a GNN. This embedding is then quantized before being fed to the decoder to estimate the positions of all backbone atoms.
  • Figure 2: Evolution of the RMSD (left) and TM-score (right) distribution with the codebook size for a dowsampling ratio of 1 on CASP-15 data.
  • Figure 3: Visualisation of the model reconstruction (blue) super-imposed with the original structures (green) for a downsampling factor of $r=2$ and $K=64000$ codes (fourth column of \ref{['tab:experiment_results']}). Each row shows a different structures seen from a different rotation angle (column). The length and reconstruction RMSD are also given on the left of the most left column.
  • Figure 4: Visualisation of generated samples (green) super-imposed with their self-consistent ESM-predicted structures (blue).
  • Figure 5: Illustration of the local attention mechanism when using 2 neighbors for aggregation.
  • ...and 17 more figures