Table of Contents
Fetching ...

Bio-xLSTM: Generative modeling, representation and in-context learning of biological and chemical sequences

Niklas Schmidinger, Lisa Schneckenreiter, Philipp Seidl, Johannes Schimunek, Pieter-Jan Hoedt, Johannes Brandstetter, Andreas Mayr, Sohvi Luukkonen, Sepp Hochreiter, Günter Klambauer

TL;DR

Assessment of xLSTM's ability to model biological and chemical sequences and a suite of architectural variants called Bio-xLSTM, which can serve as proficient generative models for DNA, protein, and chemical sequences, and learn rich representations for those modalities.

Abstract

Language models for biological and chemical sequences enable crucial applications such as drug discovery, protein engineering, and precision medicine. Currently, these language models are predominantly based on Transformer architectures. While Transformers have yielded impressive results, their quadratic runtime dependency on the sequence length complicates their use for long genomic sequences and in-context learning on proteins and chemical sequences. Recently, the recurrent xLSTM architecture has been shown to perform favorably compared to Transformers and modern state-space model (SSM) architectures in the natural language domain. Similar to SSMs, xLSTMs have a linear runtime dependency on the sequence length and allow for constant-memory decoding at inference time, which makes them prime candidates for modeling long-range dependencies in biological and chemical sequences. In this work, we tailor xLSTM towards these domains and propose a suite of architectural variants called Bio-xLSTM. Extensive experiments in three large domains, genomics, proteins, and chemistry, were performed to assess xLSTM's ability to model biological and chemical sequences. The results show that models based on Bio-xLSTM a) can serve as proficient generative models for DNA, protein, and chemical sequences, b) learn rich representations for those modalities, and c) can perform in-context learning for proteins and small molecules.

Bio-xLSTM: Generative modeling, representation and in-context learning of biological and chemical sequences

TL;DR

Assessment of xLSTM's ability to model biological and chemical sequences and a suite of architectural variants called Bio-xLSTM, which can serve as proficient generative models for DNA, protein, and chemical sequences, and learn rich representations for those modalities.

Abstract

Language models for biological and chemical sequences enable crucial applications such as drug discovery, protein engineering, and precision medicine. Currently, these language models are predominantly based on Transformer architectures. While Transformers have yielded impressive results, their quadratic runtime dependency on the sequence length complicates their use for long genomic sequences and in-context learning on proteins and chemical sequences. Recently, the recurrent xLSTM architecture has been shown to perform favorably compared to Transformers and modern state-space model (SSM) architectures in the natural language domain. Similar to SSMs, xLSTMs have a linear runtime dependency on the sequence length and allow for constant-memory decoding at inference time, which makes them prime candidates for modeling long-range dependencies in biological and chemical sequences. In this work, we tailor xLSTM towards these domains and propose a suite of architectural variants called Bio-xLSTM. Extensive experiments in three large domains, genomics, proteins, and chemistry, were performed to assess xLSTM's ability to model biological and chemical sequences. The results show that models based on Bio-xLSTM a) can serve as proficient generative models for DNA, protein, and chemical sequences, b) learn rich representations for those modalities, and c) can perform in-context learning for proteins and small molecules.

Paper Structure

This paper contains 40 sections, 11 equations, 8 figures, 17 tables.

Figures (8)

  • Figure 1: Overview of Bio-xLSTM. Top left: xLSTM for natural language processing tasks. Top right: Considered modeling approaches for biological sequences: masked language modeling, equivariance to reverse complementary sequence, and in-context learning. Bottom left: DNA-xLSTM models are trained on genomic DNA sequences and then fine-tuned on downstream tasks. Bottom center: Prot-xLSTM models are trained in a causal modeling setting with a fill-in-the-middle objective and use homologous proteins for in-context learning. Bottom right: Chem-xLSTM models are trained to generate small molecules. For an in-context learning setting, Chem-xLSTM models use molecules with known properties.
  • Figure 2: Pre-training of 2M-parameter DNA models on the human reference genome (GRCh38). Models are trained at single-nucleotide resolution with a context length of 1024 bases. Left: clm. Learning curves display ntp loss ($\downarrow$) on a test set, plotted against the number of tokens processed. Right: mlm. Learning curves showing mlm loss ($\downarrow$) on a test set across the number of tokens seen for various models. In both tasks, the xLSTM-based models consistently achieve the lowest loss values across all update steps.
  • Figure 3: Generative pre-training of protein language models. The learning curves show the validation loss of homology-aware protein language models during training. Left: Small models trained for 20B tokens with a context size of $2^{11}$ and fine-tuned for 10B with a context of $2^{17}$ tokens. Transformer++ can only be run for a small context size. Right: Prot-xLSTM-102M model trained with increasing context sizes from $2^{11}$ to $2^{18}$. Vertical gray dashed lines mark the points where context size was increased. The arrow at 60B tokens indicates the model used for downstream tasks. The orange dashed line corresponds to the validation loss of ProtMamba-107M trained up to a context size of $2^{17}$ for a total of 195B tokens. Prot-xLSTM consistently outperforms other models and sets a new state-of-the-art at homology-aware generation.
  • Figure 4: Conditional generation of molecules via icl and 15M parameter models. Left: Visualization of different molecular domains contained in the molecular domains dataset. A t-SNE down-projection of molecules from different domains is shown. Clusters on the exterior represent highly specific molecular domains. The validation and test sets contain molecules from highly specific, unseen molecular domains. Right: Generative training of chemistry language models on the molecular domains dataset. Learning curves showing mean clm loss ($\downarrow$) on a validation set across the training epochs. Shaded areas represent the standard-deviation over runs. The Chem-xLSTM achieved the lowest loss at conditional generation of molecules using icl.
  • Figure A1: xLSTM and Bio-xLSTM blocks. Left: mLSTM block.LN (Layer Normalization) and GN (Group Normalization) refer to normalization modules, while L Skip represents learnable skip connections and Conv denotes causal 1D convolutions. The mLSTM block utilizes a gated pre-up-projection structure, akin to modern State-Space Models, with gates activated by the Swish function. Middle: sLSTM block. The sLSTM block features a GELU-gated post-up-projection structure, similar to Transformer architectures. Right: Bidirectional mLSTM block. For bidirectional processing, the xLSTM applies each block to the input sequence twice before combining the outputs: once left-to-right and once right-to-left.
  • ...and 3 more figures