Table of Contents
Fetching ...

LOCO-EPI: Leave-one-chromosome-out (LOCO) as a benchmarking paradigm for deep learning based prediction of enhancer-promoter interactions

Muhammad Tahir, Shehroz S. Khan, James Davie, Soichiro Yamanaka, Ahmed Ashraf

TL;DR

This work shows that evaluating enhancer-promoter interaction (EPI) prediction models with random data splits overestimates performance due to genomic region overlap. It introduces LOCO cross-validation as a fair benchmarking paradigm and demonstrates dramatic performance drops for a baseline CNN under LOCO, highlighting overfitting issues. The authors propose a hybrid multi-branch network (MHybrid) that fuses sequence-based CNN features with 5-mer k-mer features, achieving robust LOCO performance across six human cell lines and outperforming a standard baseline and SIMCNN. Additionally, they release the LOCOSplit dataset to standardize LOCO-based benchmarking, enabling more generalizable and cell-type-aware EPI predictions in the research community.

Abstract

In mammalian and vertebrate genomes, the promoter regions of the gene and their distal enhancers may be located millions of base-pairs from each other, while a promoter may not interact with the closest enhancer. Since base-pair proximity is not a good indicator of these interactions, there is considerable work toward developing methods for predicting Enhancer-Promoter Interactions (EPI). Several machine learning methods have reported increasingly higher accuracies for predicting EPI. Typically, these approaches randomly split the dataset of Enhancer-Promoter (EP) pairs into training and testing subsets followed by model training. However, the aforementioned random splitting causes information leakage by assigning EP pairs from the same genomic region to both testing and training sets, leading to performance overestimation. In this paper we propose to use a more thorough training and testing paradigm i.e., Leave-one-chromosome-out (LOCO) cross-validation for EPI-prediction. We demonstrate that a deep learning algorithm, which gives higher accuracies when trained and tested on random-splitting setting, drops drastically in performance under LOCO setting, confirming overestimation of performance. We further propose a novel hybrid deep neural network for EPI-prediction that fuses k-mer features of the nucleotide sequence. We show that the hybrid architecture performs significantly better in the LOCO setting, demonstrating it can learn more generalizable aspects of EP interactions. With this paper we are also releasing the LOCO splitting-based EPI dataset. Research data is available in this public repository: https://github.com/malikmtahir/EPI

LOCO-EPI: Leave-one-chromosome-out (LOCO) as a benchmarking paradigm for deep learning based prediction of enhancer-promoter interactions

TL;DR

This work shows that evaluating enhancer-promoter interaction (EPI) prediction models with random data splits overestimates performance due to genomic region overlap. It introduces LOCO cross-validation as a fair benchmarking paradigm and demonstrates dramatic performance drops for a baseline CNN under LOCO, highlighting overfitting issues. The authors propose a hybrid multi-branch network (MHybrid) that fuses sequence-based CNN features with 5-mer k-mer features, achieving robust LOCO performance across six human cell lines and outperforming a standard baseline and SIMCNN. Additionally, they release the LOCOSplit dataset to standardize LOCO-based benchmarking, enabling more generalizable and cell-type-aware EPI predictions in the research community.

Abstract

In mammalian and vertebrate genomes, the promoter regions of the gene and their distal enhancers may be located millions of base-pairs from each other, while a promoter may not interact with the closest enhancer. Since base-pair proximity is not a good indicator of these interactions, there is considerable work toward developing methods for predicting Enhancer-Promoter Interactions (EPI). Several machine learning methods have reported increasingly higher accuracies for predicting EPI. Typically, these approaches randomly split the dataset of Enhancer-Promoter (EP) pairs into training and testing subsets followed by model training. However, the aforementioned random splitting causes information leakage by assigning EP pairs from the same genomic region to both testing and training sets, leading to performance overestimation. In this paper we propose to use a more thorough training and testing paradigm i.e., Leave-one-chromosome-out (LOCO) cross-validation for EPI-prediction. We demonstrate that a deep learning algorithm, which gives higher accuracies when trained and tested on random-splitting setting, drops drastically in performance under LOCO setting, confirming overestimation of performance. We further propose a novel hybrid deep neural network for EPI-prediction that fuses k-mer features of the nucleotide sequence. We show that the hybrid architecture performs significantly better in the LOCO setting, demonstrating it can learn more generalizable aspects of EP interactions. With this paper we are also releasing the LOCO splitting-based EPI dataset. Research data is available in this public repository: https://github.com/malikmtahir/EPI

Paper Structure

This paper contains 10 sections, 5 figures, 9 tables.

Figures (5)

  • Figure 1: 1D CNN (referred to as $M_{CNN}$ in the rest of the paper) for investigating if genomic overlap between training and testing data overestimates the performance. The above model i.e., $M_{CNN}$ will be used as one of the baselines to compare the performance of the proposed model in this paper, i.e., $M_{Hybrid}$.
  • Figure 2: The proposed Hybrid Architecture, MHybrid, to predict EPI. Enhancer and Promoter sequences both are passed through a 1D CNN, and also through k-mer feature extractor. The CNN and k-mer for each branch are then concatenated together for predicting the probability of interaction between an EP-pair
  • Figure 3: Step by step illustration of 23-fold LOCO cross-validation training and testing process.
  • Figure 4: Box plots showing variation of AUCs across 23 folds for different cell lines in LOCOSplit setting. Within each of the four panels above, the six box plots correspond to each of the six cell lines. (a) Box plots for AUCs of MCNN (b) Box plots for AUCs of MHybrid (proposed model) (c) Box plots for difference in AUCs for MHybrid and MCNN. ($\Delta$AUC = AUROCMHybrid – AUROCMCNN) (d) Box plots for p-values for difference between all AUCs of MHybrid and MCNN.
  • Figure 5: Performance comparison between the proposed $M_{Hybrid}$ model and a widely used baseline model SIMCNN zhuang2019simple in terms of Box plots showing variation of AUCs across 23 folds for different cell lines in LOCOSplit setting. Within each of the three panels above, the six box-plots correspond to each of the six cell lines. (a) and (b) Box plots for AUCs of our proposed MHybrid model and SIMCNN zhuang2019simple. $M_{Hybrid}$ performs better in terms of median, minimum, and maximum AUCs over all 23 folds for every cell line. (c) Box plots for p-values for analyzing the statistic significance of AUC performance difference between MHybrid and SIMCNN zhuang2019simple models.