Table of Contents
Fetching ...

DPLM-2: A Multimodal Diffusion Protein Language Model

Xinyou Wang, Zaixiang Zheng, Fei Ye, Dongyu Xue, Shujian Huang, Quanquan Gu

TL;DR

<3-5 sentence high-level summary> This work investigates the scalability of discrete diffusion language models by reprogramming pretrained masked LMs into diffusion LMs through diffusive adaptation and instruction tuning. It establishes a theoretical link between absorbing diffusion and masked language modeling (via reparameterized discrete diffusion) and demonstrates competitive performance against autoregressive baselines on multilingual translation and text summarization, especially with large-scale pretraining. The study also shows zero-shot and in-context learning capabilities emerge with instruction tuning, and observes promising but still limited reasoning abilities that improve with model size and data. Limitations include context-length constraints and arithmetic reasoning gaps, pointing to future work in pretraining diffusion LMs from scratch and enhancing reasoning and long-range generation.

Abstract

Proteins are essential macromolecules defined by their amino acid sequences, which determine their three-dimensional structures and, consequently, their functions in all living organisms. Therefore, generative protein modeling necessitates a multimodal approach to simultaneously model, understand, and generate both sequences and structures. However, existing methods typically use separate models for each modality, limiting their ability to capture the intricate relationships between sequence and structure. This results in suboptimal performance in tasks that requires joint understanding and generation of both modalities. In this paper, we introduce DPLM-2, a multimodal protein foundation model that extends discrete diffusion protein language model (DPLM) to accommodate both sequences and structures. To enable structural learning with the language model, 3D coordinates are converted to discrete tokens using a lookup-free quantization-based tokenizer. By training on both experimental and high-quality synthetic structures, DPLM-2 learns the joint distribution of sequence and structure, as well as their marginals and conditionals. We also implement an efficient warm-up strategy to exploit the connection between large-scale evolutionary data and structural inductive biases from pre-trained sequence-based protein language models. Empirical evaluation shows that DPLM-2 can simultaneously generate highly compatible amino acid sequences and their corresponding 3D structures eliminating the need for a two-stage generation approach. Moreover, DPLM-2 demonstrates competitive performance in various conditional generation tasks, including folding, inverse folding, and scaffolding with multimodal motif inputs, as well as providing structure-aware representations for predictive tasks.

DPLM-2: A Multimodal Diffusion Protein Language Model

TL;DR

<3-5 sentence high-level summary> This work investigates the scalability of discrete diffusion language models by reprogramming pretrained masked LMs into diffusion LMs through diffusive adaptation and instruction tuning. It establishes a theoretical link between absorbing diffusion and masked language modeling (via reparameterized discrete diffusion) and demonstrates competitive performance against autoregressive baselines on multilingual translation and text summarization, especially with large-scale pretraining. The study also shows zero-shot and in-context learning capabilities emerge with instruction tuning, and observes promising but still limited reasoning abilities that improve with model size and data. Limitations include context-length constraints and arithmetic reasoning gaps, pointing to future work in pretraining diffusion LMs from scratch and enhancing reasoning and long-range generation.

Abstract

Proteins are essential macromolecules defined by their amino acid sequences, which determine their three-dimensional structures and, consequently, their functions in all living organisms. Therefore, generative protein modeling necessitates a multimodal approach to simultaneously model, understand, and generate both sequences and structures. However, existing methods typically use separate models for each modality, limiting their ability to capture the intricate relationships between sequence and structure. This results in suboptimal performance in tasks that requires joint understanding and generation of both modalities. In this paper, we introduce DPLM-2, a multimodal protein foundation model that extends discrete diffusion protein language model (DPLM) to accommodate both sequences and structures. To enable structural learning with the language model, 3D coordinates are converted to discrete tokens using a lookup-free quantization-based tokenizer. By training on both experimental and high-quality synthetic structures, DPLM-2 learns the joint distribution of sequence and structure, as well as their marginals and conditionals. We also implement an efficient warm-up strategy to exploit the connection between large-scale evolutionary data and structural inductive biases from pre-trained sequence-based protein language models. Empirical evaluation shows that DPLM-2 can simultaneously generate highly compatible amino acid sequences and their corresponding 3D structures eliminating the need for a two-stage generation approach. Moreover, DPLM-2 demonstrates competitive performance in various conditional generation tasks, including folding, inverse folding, and scaffolding with multimodal motif inputs, as well as providing structure-aware representations for predictive tasks.

Paper Structure

This paper contains 25 sections, 3 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Overview. (A) Comparative illustration of language model (LM) paradigms, , autoregressive LMs diffusion LMs. (B) Overall illustration of the proposed approach wherein massively pretrained masked LMs are reprogrammed to diffusion LMs via generative surgery.
  • Figure 2: An exemplary generation process on machine translation. Notice that the target translation contains three segments, which are generated simultaneously by the diffusion language model.
  • Figure 3: Scaling curves of task-specific finetuning on IWSLT14, WMT14 and Gigaword-10K. We obtain results of mT5 xue2020mt5 on IWSLT14 by ourselves. The results of T5 on WMT14 are from raffel2020T5. "OL": results obtained with oracle target lengths. "LB=10": length prediction results with 10 length beams. "#Params.": Number of effective parameters (, non-embedding parameters).
  • Figure 4: Zero-shot performance of Flan-XLM-R models. OL means the results are obtained with oracle length, while LB means the number of length beams to sample the target with length prediction. The model sizes refer to the number of non-embedding parameters.
  • Figure 5: Few-shot performance of Flan-XLM-R and Flan-T5 models. "OL" means the results are obtained with oracle length, while "LB" means the number of length beams to sample the target with length prediction. The model sizes refer to the number of non-embedding parameters.
  • ...and 1 more figures