Towards Joint Sequence-Structure Generation of Nucleic Acid and Protein Complexes with SE(3)-Discrete Diffusion

Alex Morehead; Jeffrey Ruffolo; Aadyot Bhatnagar; Ali Madani

Towards Joint Sequence-Structure Generation of Nucleic Acid and Protein Complexes with SE(3)-Discrete Diffusion

Alex Morehead, Jeffrey Ruffolo, Aadyot Bhatnagar, Ali Madani

TL;DR

This work addresses the challenge of jointly designing sequences and 3D structures for nucleic acid–protein complexes, a setting where prior methods typically focus on either proteins or fixed backbones. It introduces MMDiff, a diffusion-based model that combines $SE(3)$-based structure denoising with discrete sequence diffusion, enabling co-design of proteins, nucleic acids, and their interactions. The approach leverages FrameDiff-inspired architecture, continuous representations for discrete sequences, and consensus sampling, validated on a new open benchmark that demonstrates designable, diverse, and novel designs, including micro-RNA and ssDNA. The work highlights the potential and limitations of current data for macromolecular co-design and points to future directions like larger datasets and end-to-end full-atom validation to advance practical macromolecular engineering.

Abstract

Generative models of macromolecules carry abundant and impactful implications for industrial and biomedical efforts in protein engineering. However, existing methods are currently limited to modeling protein structures or sequences, independently or jointly, without regard to the interactions that commonly occur between proteins and other macromolecules. In this work, we introduce MMDiff, a generative model that jointly designs sequences and structures of nucleic acid and protein complexes, independently or in complex, using joint SE(3)-discrete diffusion noise. Such a model has important implications for emerging areas of macromolecular design including structure-based transcription factor design and design of noncoding RNA sequences. We demonstrate the utility of MMDiff through a rigorous new design benchmark for macromolecular complex generation that we introduce in this work. Our results demonstrate that MMDiff is able to successfully generate micro-RNA and single-stranded DNA molecules while being modestly capable of joint modeling DNA and RNA molecules in interaction with multi-chain protein complexes. Source code: https://github.com/Profluent-Internships/MMDiff.

Towards Joint Sequence-Structure Generation of Nucleic Acid and Protein Complexes with SE(3)-Discrete Diffusion

TL;DR

-based structure denoising with discrete sequence diffusion, enabling co-design of proteins, nucleic acids, and their interactions. The approach leverages FrameDiff-inspired architecture, continuous representations for discrete sequences, and consensus sampling, validated on a new open benchmark that demonstrates designable, diverse, and novel designs, including micro-RNA and ssDNA. The work highlights the potential and limitations of current data for macromolecular co-design and points to future directions like larger datasets and end-to-end full-atom validation to advance practical macromolecular engineering.

Abstract

Paper Structure (12 sections, 6 equations, 11 figures, 1 table)

This paper contains 12 sections, 6 equations, 11 figures, 1 table.

Introduction
Methodology
Preliminaries and Notation
Joint Continuous-Discrete Diffusion in $\mathbb{R}^{3}$
Generating macromolecular complexes with MMDiff
Experiments
Related Work
Discussions & Conclusions
Supplementary Material
Additional Designability Results
Alternative Noise Schedules for Sequence Generation
Dataset Distributions

Figures (11)

Figure 1: Our proposed Macromolecular Diffusion Model (MMDiff) jointly designs macromolecular sequences and structures. A. An overview of MMDiff. For each nucleic acid residue, a rigid body frame centered at the $\mathrm{C4^{\prime}}$ atom is constructed. To build such a frame, the $\mathrm{GramSchmidt}$ algorithm is applied to a residue's $v_{1}$ and $v_{2}$ vectors, in the process placing its $C3^{\prime}$, $O4^{\prime}$, and $C5^{\prime}$ atoms with respect to the position of the $C4^{\prime}$ atom. The positions of all other residue atoms are placed autoregressively according to a corresponding torsion angle ($\Phi$) predicted by MMDiff. B. An illustrative example of how MMDiff generates realistic macromolecular samples. Through the iterative process of denoising $N$ geometric frames and $N$ one-hot sequence vectors initialized from their respective reference distributions, MMDiff transitions an initially-random sequence-structure pair at timestep $\mathrm{T_{F}}$ into a coherent macromolecule at timestep $0$, at which point each frame and its associated torsion angles are used to construct the position of each atom.
Figure 2: Comparison of $\mathrm{scRMSD}$ complex designability results using different training methods. Here, the top row corresponds to protein-only experiments, the middle row to nucleic acid-only experiments, and the bottom row to protein-nucleic acid experiments. The columns denotes samples generated using the random macromolecule generation baseline, MMDiff, and MMDiff-{Protein, NA, Monomer} (corresponding to the {first, second, third} row), respectively. Note that novel data samples are displayed with a * symbol. Overall, most designable complexes contain 2-3 chains, and most generated complexes contain novel chains (with a novelty $> 0.7$).
Figure 3: Examples of macromolecules successfully designed by MMDiff.
Figure 4: Comparison of $\mathrm{scTM}$ complex designability results using different training methods. Here, the top row corresponds to protein-only experiments, the middle row to nucleic acid-only experiments, and the bottom row to protein-nucleic acid experiments. The columns denotes samples generated using the random macromolecule generation baseline, MMDiff, and MMDiff-{Protein, NA, Monomer} (corresponding to the {first, second, third} row), respectively. Note that novel data samples are displayed with a * symbol. Overall, most designable complexes contain 2-3 chains, and most generated complexes contain novel chains (with a novelty $> 0.7$).
Figure 5: Comparison of $\mathrm{scRMSD}$ complex designability results for nucleic acid structures using different sequence noise schedules during sampling. Zoom in for the best viewing experience.
...and 6 more figures

Towards Joint Sequence-Structure Generation of Nucleic Acid and Protein Complexes with SE(3)-Discrete Diffusion

TL;DR

Abstract

Towards Joint Sequence-Structure Generation of Nucleic Acid and Protein Complexes with SE(3)-Discrete Diffusion

Authors

TL;DR

Abstract

Table of Contents

Figures (11)