Extending Sequence Length is Not All You Need: Effective Integration of Multimodal Signals for Gene Expression Prediction

Zhao Yang; Yi Duan; Jiwei Zhu; Ying Ba; Chuan Cao; Bing Su

Extending Sequence Length is Not All You Need: Effective Integration of Multimodal Signals for Gene Expression Prediction

Zhao Yang, Yi Duan, Jiwei Zhu, Ying Ba, Chuan Cao, Bing Su

TL;DR

Prism, a framework that learns multiple combinations of high-dimensional epigenomic features to represent distinct background chromatin states and uses backdoor adjustment to mitigate confounding effects, is proposed.

Abstract

Gene expression prediction, which predicts mRNA expression levels from DNA sequences, presents significant challenges. Previous works often focus on extending input sequence length to locate distal enhancers, which may influence target genes from hundreds of kilobases away. Our work first reveals that for current models, long sequence modeling can decrease performance. Even carefully designed algorithms only mitigate the performance degradation caused by long sequences. Instead, we find that proximal multimodal epigenomic signals near target genes prove more essential. Hence we focus on how to better integrate these signals, which has been overlooked. We find that different signal types serve distinct biological roles, with some directly marking active regulatory elements while others reflect background chromatin patterns that may introduce confounding effects. Simple concatenation may lead models to develop spurious associations with these background patterns. To address this challenge, we propose Prism, a framework that learns multiple combinations of high-dimensional epigenomic features to represent distinct background chromatin states and uses backdoor adjustment to mitigate confounding effects. Our experimental results demonstrate that proper modeling of multimodal epigenomic signals achieves state-of-the-art performance using only short sequences for gene expression prediction.

Extending Sequence Length is Not All You Need: Effective Integration of Multimodal Signals for Gene Expression Prediction

TL;DR

Abstract

Paper Structure (41 sections, 5 equations, 6 figures, 16 tables, 1 algorithm)

This paper contains 41 sections, 5 equations, 6 figures, 16 tables, 1 algorithm.

Introduction
Current methods do not benefit from long sequence input
Method
Problem Formulation
Structural Causal Model
Causal Intervention via Backdoor Adjustment
Functional Implementation
Training Objective
Experiments
Experimental Setup
Results of Gene Expression Prediction
Hyperparameter Sensitivity Analysis
Analysis of Learned Weights
Parameter Overhead
Related Work
...and 26 more sections

Figures (6)

Figure 1: (a) Long-range regulatory interactions through chromatin looping. (b) Current long-sequence models suffer from technical limitations. (c) Multimodal epigenomic signals provide cell-type specific regulatory information. (d) Performance of Seq2Exp su2025learning and Caduceus caduceus with varying input sequence lengths. (e) Different signals show varying contributions. (f) Performance degradation when specific signals are removed during testing from a model trained with all signals.
Figure 2: Shortening input length at test time.
Figure 3: The SCM.
Figure 4: Architecture of Prism. Epigenomic signals $S$ are processed by two encoders: a signal encoder $g_{\theta}$ extracts high-dimension epigenomic features $H$, while a confounder encoder $g_{\omega}$ learns $n$ distinct weights representing the confounder $C$. A final predictor $h_{\phi}$ uses these weighted features along with the DNA sequence $X$ to make a prediction.
Figure 5: Visualization of learned confounder weights ($a_1, a_2$) for three sampled genes.
...and 1 more figures

Extending Sequence Length is Not All You Need: Effective Integration of Multimodal Signals for Gene Expression Prediction

TL;DR

Abstract

Extending Sequence Length is Not All You Need: Effective Integration of Multimodal Signals for Gene Expression Prediction

Authors

TL;DR

Abstract

Table of Contents

Figures (6)