Table of Contents
Fetching ...

Artificial Intelligence and Deep Learning Algorithms for Epigenetic Sequence Analysis: A Review for Epigeneticists and AI Experts

Muhammad Tahir, Mahboobeh Norouzi, Shehroz S. Khan, James R. Davie, Soichiro Yamanaka, Ahmed Ashraf

TL;DR

The paper surveys the intersection of artificial intelligence and deep learning with epigenetic sequence analysis, presenting a dual-perspective taxonomy that helps AI researchers identify tractable epigenetic problems and helps epigeneticists map those problems to suitable AI paradigms. It covers data types and public resources, reviews core DL architectures (CNNs, RNNs/LSTMs, autoencoders, transformers) and their applicability to epigenetic tasks, and systematically organizes literature by problem category: disease-marker prediction, gene expression, enhancer–promoter interactions, chromatin state discovery, and representation learning. The review highlights representative methods (e.g., DISMIR, DeepHistone, DeepChrome, SPEID, ChromeGCN, DeepC, ChromTransfer) and reports high performance in several domains, while also identifying pervasive challenges such as data imbalance and cross-dataset generalization. It offers concrete recommendations on data augmentation, contrastive learning, transfer learning, model interpretability, and wet-lab validation to advance robust, generalizable epigenetic AI solutions with potential clinical impact.

Abstract

Epigenetics encompasses mechanisms that can alter the expression of genes without changing the underlying genetic sequence. The epigenetic regulation of gene expression is initiated and sustained by several mechanisms such as DNA methylation, histone modifications, chromatin conformation, and non-coding RNA. The changes in gene regulation and expression can manifest in the form of various diseases and disorders such as cancer and congenital deformities. Over the last few decades, high throughput experimental approaches have been used to identify and understand epigenetic changes, but these laboratory experimental approaches and biochemical processes are time-consuming and expensive. To overcome these challenges, machine learning and artificial intelligence (AI) approaches have been extensively used for mapping epigenetic modifications to their phenotypic manifestations. In this paper we provide a narrative review of published research on AI models trained on epigenomic data to address a variety of problems such as prediction of disease markers, gene expression, enhancer promoter interaction, and chromatin states. The purpose of this review is twofold as it is addressed to both AI experts and epigeneticists. For AI researchers, we provided a taxonomy of epigenetics research problems that can benefit from an AI-based approach. For epigeneticists, given each of the above problems we provide a list of candidate AI solutions in the literature. We have also identified several gaps in the literature, research challenges, and recommendations to address these challenges.

Artificial Intelligence and Deep Learning Algorithms for Epigenetic Sequence Analysis: A Review for Epigeneticists and AI Experts

TL;DR

The paper surveys the intersection of artificial intelligence and deep learning with epigenetic sequence analysis, presenting a dual-perspective taxonomy that helps AI researchers identify tractable epigenetic problems and helps epigeneticists map those problems to suitable AI paradigms. It covers data types and public resources, reviews core DL architectures (CNNs, RNNs/LSTMs, autoencoders, transformers) and their applicability to epigenetic tasks, and systematically organizes literature by problem category: disease-marker prediction, gene expression, enhancer–promoter interactions, chromatin state discovery, and representation learning. The review highlights representative methods (e.g., DISMIR, DeepHistone, DeepChrome, SPEID, ChromeGCN, DeepC, ChromTransfer) and reports high performance in several domains, while also identifying pervasive challenges such as data imbalance and cross-dataset generalization. It offers concrete recommendations on data augmentation, contrastive learning, transfer learning, model interpretability, and wet-lab validation to advance robust, generalizable epigenetic AI solutions with potential clinical impact.

Abstract

Epigenetics encompasses mechanisms that can alter the expression of genes without changing the underlying genetic sequence. The epigenetic regulation of gene expression is initiated and sustained by several mechanisms such as DNA methylation, histone modifications, chromatin conformation, and non-coding RNA. The changes in gene regulation and expression can manifest in the form of various diseases and disorders such as cancer and congenital deformities. Over the last few decades, high throughput experimental approaches have been used to identify and understand epigenetic changes, but these laboratory experimental approaches and biochemical processes are time-consuming and expensive. To overcome these challenges, machine learning and artificial intelligence (AI) approaches have been extensively used for mapping epigenetic modifications to their phenotypic manifestations. In this paper we provide a narrative review of published research on AI models trained on epigenomic data to address a variety of problems such as prediction of disease markers, gene expression, enhancer promoter interaction, and chromatin states. The purpose of this review is twofold as it is addressed to both AI experts and epigeneticists. For AI researchers, we provided a taxonomy of epigenetics research problems that can benefit from an AI-based approach. For epigeneticists, given each of the above problems we provide a list of candidate AI solutions in the literature. We have also identified several gaps in the literature, research challenges, and recommendations to address these challenges.

Paper Structure

This paper contains 15 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Graphical overview of the taxonomy of various problems in epigenetic sequence analysis covered in this review article. (a) Different types of epigenetic problems. (b) A tabular illustration of which problem corresponds to which format (F) and learning paradigm (LP), e.g., first row shows that 'Disease marker prediction and detection' (orange) corresponds to Format 1 (F1: Sequence-to-scalar) and Learning Paradigm 1 (LP1: Supervised). (c) Illustration of which neural network architecture corresponds to which F and LP, e.g., Autoencoders in the context of sequential inputs would correspond to F2: Sequence-to-sequence and LP2: Unsupervised.
  • Figure 2: Different learning paradigms for DL: supervised, Unsupervised, and Reinforcement learning
  • Figure 3: An example of CNN consists of the input (DNA sequence) used one-hot-encoding, a convolutional, pooling, and fully connected layers with output.
  • Figure 4: A representation of the RNN architecture with its corresponding functional components such that X is the input layer, h is the hidden layer (h(t) and h(t-1) are new and previous states), O is the output layer. U, V, and W represent the model parameters.
  • Figure 5: Architecture of an autoencoder consists of an input layer containing $n$ elements from $X_{1}$ to $X_{n}$, encoder, bottleneck, decoder, and output layers containing $n$ elements from $Y_{1}$ to $Y_{n}$ which are reconstructed from original input data.
  • ...and 1 more figures