Table of Contents
Fetching ...

Life-Code: Central Dogma Modeling with Multi-Omics Sequence Unification

Zicheng Liu, Siyuan Li, Zhiyuan Chen, Chang Yu, Qirong Yang, Yucheng Guo, Yujie Yang, Xiaoming Zhang, Stan Z. Li

TL;DR

This work tackles the fragmentation of multi-omics modeling by grounding DNA, RNA, and proteins in a unified nucleotide representation via Life-Code. It introduces a codon-level tokenizer and bi-directional hybrid encoder with long-sequence efficient attention, based on central dogma mappings and knowledge distillation from protein LMs. Through extensive experiments across genomic benchmarks, RNA splicing, protein fitness, and ncRNA-protein interactions, Life-Code achieves state-of-the-art or competitive results. The approach advances multi-omics interpretation and cross-modality transfer while delivering efficient processing of long biological sequences, though limitations remain in non-coding region heterogeneity and post-translational modifications.

Abstract

The interactions between DNA, RNA, and proteins are fundamental to biological processes, as illustrated by the central dogma of molecular biology. Although modern biological pre-trained models have achieved great success in analyzing these macromolecules individually, their interconnected nature remains underexplored. This paper follows the guidance of the central dogma to redesign both the data and model pipeline and offers a comprehensive framework, Life-Code, that spans different biological functions. As for data flow, we propose a unified pipeline to integrate multi-omics data by reverse-transcribing RNA and reverse-translating amino acids into nucleotide-based sequences. As for the model, we design a codon tokenizer and a hybrid long-sequence architecture to encode the interactions between coding and non-coding regions through masked modeling pre-training. To model the translation and folding process with coding sequences, Life-Code learns protein structures of the corresponding amino acids by knowledge distillation from off-the-shelf protein language models. Such designs enable Life-Code to capture complex interactions within genetic sequences, providing a more comprehensive understanding of multi-omics with the central dogma. Extensive experiments show that Life-Code achieves state-of-the-art results on various tasks across three omics, highlighting its potential for advancing multi-omics analysis and interpretation.

Life-Code: Central Dogma Modeling with Multi-Omics Sequence Unification

TL;DR

This work tackles the fragmentation of multi-omics modeling by grounding DNA, RNA, and proteins in a unified nucleotide representation via Life-Code. It introduces a codon-level tokenizer and bi-directional hybrid encoder with long-sequence efficient attention, based on central dogma mappings and knowledge distillation from protein LMs. Through extensive experiments across genomic benchmarks, RNA splicing, protein fitness, and ncRNA-protein interactions, Life-Code achieves state-of-the-art or competitive results. The approach advances multi-omics interpretation and cross-modality transfer while delivering efficient processing of long biological sequences, though limitations remain in non-coding region heterogeneity and post-translational modifications.

Abstract

The interactions between DNA, RNA, and proteins are fundamental to biological processes, as illustrated by the central dogma of molecular biology. Although modern biological pre-trained models have achieved great success in analyzing these macromolecules individually, their interconnected nature remains underexplored. This paper follows the guidance of the central dogma to redesign both the data and model pipeline and offers a comprehensive framework, Life-Code, that spans different biological functions. As for data flow, we propose a unified pipeline to integrate multi-omics data by reverse-transcribing RNA and reverse-translating amino acids into nucleotide-based sequences. As for the model, we design a codon tokenizer and a hybrid long-sequence architecture to encode the interactions between coding and non-coding regions through masked modeling pre-training. To model the translation and folding process with coding sequences, Life-Code learns protein structures of the corresponding amino acids by knowledge distillation from off-the-shelf protein language models. Such designs enable Life-Code to capture complex interactions within genetic sequences, providing a more comprehensive understanding of multi-omics with the central dogma. Extensive experiments show that Life-Code achieves state-of-the-art results on various tasks across three omics, highlighting its potential for advancing multi-omics analysis and interpretation.

Paper Structure

This paper contains 54 sections, 9 equations, 9 figures, 13 tables.

Figures (9)

  • Figure 1: Overview of the Central Dogma Data Flow with biological language models (LM), separately proposed.
  • Figure 2: Illustration of Life-Code Framework, which contains three pipelines. (a) Data Pre-processing for the unified input with the central dogma. (b) Pre-training of the Tokenizer and the hybrid Encoder to model the contextual information and the central dogma rules. (c) Transfer Learning to multi-omics downstream tasks with parameter-efficient Supervised Fine-tuning.
  • Figure 3: Data Sampling Pipeline. During pre-training, we sample the DNA sequences and the pair of CDS and amino acids from the DNA dataset and the DNA-AA pairing dataset. Since the DNA sequence sampled from the RefSeq is much longer than the CDS, we predefined the max sequence length (e.g., 8k) for each sample. As for the CDS and its corresponding amino acids, we apply the packing strategy Warner2024ModernBERT to sample with several CDSs.
  • Figure 4: Life-Code Tokenizer. To model the interactions among DNA, RNA, and Amino Acids, our tokenizer takes nucleotide acids (4 words) as the inputs, then takes codons as the latent vocabulary (64 words), and translates them to Amino Acids (20 words) as the outputs, which could be pre-trained by (a) nucleotide masked modeling and (b) CDS to Amino Acid translation.
  • Figure 5: Empirical Analysis of Life-Code Tokenizer. Left:Codon Usage Bias in Life-Code Tokenizer across four representative species---E. coli, S. cerevisiae (yeast), D. melanogaster (fruit fly), and H. sapiens (human)---illustrating variations in codon frequency (%) for amino acids. The pastel strips highlight codons belonging to the same amino acid group. Right:t-SNE visualization of learned codon embeddings in Life-Code Tokenizer, where codons that translate to the same or biochemically similar amino acids cluster together (e.g., hydrophobic or charged groups), and stop codons form a distinct region.
  • ...and 4 more figures