Learning to Discover Regulatory Elements for Gene Expression Prediction
Xingyu Su, Haiyang Yu, Degui Zhi, Shuiwang Ji
TL;DR
Seq2Exp tackles gene expression prediction by learning active regulatory elements from long DNA sequences and epigenomic signals via a Beta-distributed mask guided by a causal framework and the information bottleneck. The model decomposes contributions from sequence and signals, samples a soft mask, and uses a predictor on masked inputs to achieve state-of-the-art performance on CAGE-based expression across two cell types. It also demonstrates superior regulatory-element discovery compared with peak-based methods like MACS3. The approach lays groundwork for scalable regulatory-element extraction with potential extensions to more cell types and richer epigenomic data, enabling broader impact in regulatory genomics.
Abstract
We consider the problem of predicting gene expressions from DNA sequences. A key challenge of this task is to find the regulatory elements that control gene expressions. Here, we introduce Seq2Exp, a Sequence to Expression network explicitly designed to discover and extract regulatory elements that drive target gene expression, enhancing the accuracy of the gene expression prediction. Our approach captures the causal relationship between epigenomic signals, DNA sequences and their associated regulatory elements. Specifically, we propose to decompose the epigenomic signals and the DNA sequence conditioned on the causal active regulatory elements, and apply an information bottleneck with the Beta distribution to combine their effects while filtering out non-causal components. Our experiments demonstrate that Seq2Exp outperforms existing baselines in gene expression prediction tasks and discovers influential regions compared to commonly used statistical methods for peak detection such as MACS3. The source code is released as part of the AIRS library (https://github.com/divelab/AIRS/).
