Table of Contents
Fetching ...

MEDNA-DFM: A Dual-View FiLM-MoE Model for Explainable DNA Methylation Prediction

Yi He, Yina Cao, Jixiu Zhai, Di Wang, Junxiao Kong, Tianchi Lu

TL;DR

This work provides a powerful tool for methylation prediction and demonstrates how explainable deep learning can drive both methodological innovation and the generation of biological hypotheses.

Abstract

Accurate computational identification of DNA methylation is essential for understanding epigenetic regulation. Although deep learning excels in this binary classification task, its "black-box" nature impedes biological insight. We address this by introducing a high-performance model MEDNA-DFM, alongside mechanism-inspired signal purification algorithms. Our investigation demonstrates that MEDNA-DFM effectively captures conserved methylation patterns, achieving robust distinction across diverse species. Validation on external independent datasets confirms that the model's generalization is driven by conserved intrinsic motifs (e.g., GC content) rather than phylogenetic proximity. Furthermore, applying our developed algorithms extracted motifs with significantly higher reliability than prior studies. Finally, empirical evidence from a Drosophila 6mA case study prompted us to propose a "sequence-structure synergy" hypothesis, suggesting that the GAGG core motif and an upstream A-tract element function cooperatively. We further validated this hypothesis via in silico mutagenesis, confirming that the ablation of either or both elements significantly degrades the model's recognition capabilities. This work provides a powerful tool for methylation prediction and demonstrates how explainable deep learning can drive both methodological innovation and the generation of biological hypotheses.

MEDNA-DFM: A Dual-View FiLM-MoE Model for Explainable DNA Methylation Prediction

TL;DR

This work provides a powerful tool for methylation prediction and demonstrates how explainable deep learning can drive both methodological innovation and the generation of biological hypotheses.

Abstract

Accurate computational identification of DNA methylation is essential for understanding epigenetic regulation. Although deep learning excels in this binary classification task, its "black-box" nature impedes biological insight. We address this by introducing a high-performance model MEDNA-DFM, alongside mechanism-inspired signal purification algorithms. Our investigation demonstrates that MEDNA-DFM effectively captures conserved methylation patterns, achieving robust distinction across diverse species. Validation on external independent datasets confirms that the model's generalization is driven by conserved intrinsic motifs (e.g., GC content) rather than phylogenetic proximity. Furthermore, applying our developed algorithms extracted motifs with significantly higher reliability than prior studies. Finally, empirical evidence from a Drosophila 6mA case study prompted us to propose a "sequence-structure synergy" hypothesis, suggesting that the GAGG core motif and an upstream A-tract element function cooperatively. We further validated this hypothesis via in silico mutagenesis, confirming that the ablation of either or both elements significantly degrades the model's recognition capabilities. This work provides a powerful tool for methylation prediction and demonstrates how explainable deep learning can drive both methodological innovation and the generation of biological hypotheses.
Paper Structure (33 sections, 17 equations, 8 figures, 2 tables)

This paper contains 33 sections, 17 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Comprehensive performance evaluation and ablation analysis of the MEDNA-DFM.(A, B) Comparison of predictive performance between MEDNA-DFM and representative state-of-the-art methods across 17 benchmark datasets. The line charts illustrate the AUC (A) and MCC (B) scores. (C, D) Impact of token granularity on feature extraction. The bar charts display AUC and MCC metrics for different k-mer tokenization strategies (3, 4, 5, and 6-mer) utilized in the Dual-View DNABERT module. (E, F) Ablation study on the efficacy of the FiLM module. Boxplots summarize the distribution of AUC and MCC values across all datasets for four structural variants: A1 (Full Model with Global-to-Local FiLM modulation), A2 (Simple Fusion via concatenation), A3 (6-mer backbone only), and A4 (Reversed FiLM with Local-to-Global modulation). The red dashed line indicates the median performance of the proposed A1 model. (G, H) Analysis of model capacity regarding the MoE module. Boxplots compare the AUC and MCC distributions for the baseline model without MoE (Non) and variants equipped with increasing numbers of experts (MoE-1, MoE-2, MoE-4, MoE-8).
  • Figure 2: Impact of domain-adaptive fine-tuning and adversarial regularization on model robustness.(A,B,C) Performance comparison with and without fine-tuning. The scatter plots display the ACC (A), AUC (B), and MCC (C) scores across all datasets. The y-axis represents the MEDNA-DFM model utilizing domain-adaptive fine-tuning, while the x-axis represents the model without this phase.(D,E,F) Efficacy of adversarial training. Comparison of ACC (D), AUC (E), and MCC (F) between models trained with (y-axis) and without (x-axis) adversarial regularization. In all plots, each point represents one benchmark dataset. The purple dashed diagonal line ($y=x$) indicates the baseline of identical performance; points located above this line demonstrate that the applied strategy (fine-tuning or adversarial training) yields superior predictive metrics.
  • Figure 3: Visualization of internal representation dynamics and cross-species generalization capabilities.(A--D) Trajectory of feature space evolution via UMAP visualization. The scatter plots illustrate the distribution of samples at four key processing stages: (A) Raw Embeddings, (B) Post-DNABERT, (C) Post-FiLM, and (D) Post-MoE. Blue points represent positive samples, while yellow points represent negative samples. (E) Cross-dataset heatmap. The matrix depicts the pairwise evaluation where rows represent source datasets (training) and columns represent target datasets (testing). Color intensity reflects the relative transferability, calculated as the ACC on the target set normalized by the source-on-source baseline. The chemical structures at the bottom correspond to the three DNA methylation categories: 5-hydroxymethylcytosine (5hmC), N4-methylcytosine (4mC), and N6-methyladenine (6mA).
  • Figure 4: Mechanism of motif-driven generalization across evolutionary boundaries.(A--D) Comparative motif analysis using kpLogo illustrating the sequence preferences of the target and source datasets. (A) The external independent validation set Homo sapiens (5mC). (B--D) The source datasets used for training: (B)M. musculus (5hmC), (C)C. equisetifolia (4mC), (D)C. elegans (6mA). (E) AUC of the three source models directly applied to the external H. sapiens dataset. (F) Contrast between phylogenetic distance and predictive accuracy (ACC). The phylogenetic tree shows the evolutionary proximity of the species to H. sapiens, while the bar chart (right) reveals that the evolutionarily distant plant model (C. equisetifolia) significantly outperforms the closely related nematode model (C. elegans).
  • Figure 5: Signal disentanglement and high-fidelity motif purification via MEDNA-DFM.(A) Comparison of sequence logo landscapes for the 6mA_D.melanogaster dataset. The traditional KpLogo analysis effectively represents a composite union of these two independent feature subsets. (B) Analytical workflow for model-guided motif discovery and validation. Sequence features extracted from CAD and CWGA methods, along with raw data, are processed through STREME. The resulting motifs are validated via TOMTOM to identify statistically significant matches to known transcription factor binding sites.
  • ...and 3 more figures