Table of Contents
Fetching ...

A Deep Learning Pipeline for Epilepsy Genomic Analysis Using GPT-2 XL and NVIDIA H100

Muhammad Omer Latif, Hayat Ullah, Muhammad Ali Shafique, Zhihua Dong

TL;DR

Epilepsy transcriptomics presents high-dimensional, sparse data challenges that impede rapid molecular insight. The authors introduce a GPU-accelerated pipeline that fine-tunes GPT-2 XL on encoded transcriptomic data, combining token-based gene representations with classic dimensionality reduction and heatmap visualization. Their approach yields state-of-the-art predictive performance (AUC $=0.90$, F1 $=0.88$) on two epilepsy datasets, while recovering biologically meaningful signatures such as GRIA1 upregulation, GRIA2 downregulation, interneuron markers SST and PVLAB dysregulation, and FOSB induction. NVIDIA's H100 GPUs dramatically accelerate training and inference, demonstrating the feasibility of transformer-based transcriptomics in data-limited neurogenomics contexts and paving the way for multimodal extensions in precision neuromedicine.

Abstract

Epilepsy is a chronic neurological condition characterized by recurrent seizures, with global prevalence estimated at 50 million people worldwide. While progress in high-throughput sequencing has allowed for broad-based transcriptomic profiling of brain tissues, the deciphering of these highly complex datasets remains one of the challenges. To address this issue, in this paper we propose a new analysis pipeline that integrates the power of deep learning strategies with GPU-acceleration computation for investigating Gene expression patterns in epilepsy. Specifically, our proposed approach employs GPT-2 XL, a transformer-based Large Language Model (LLM) with 1.5 billion parameters for genomic sequence analysis over the latest NVIDIA H100 Tensor Core GPUs based on Hopper architecture. Our proposed method enables efficient preprocessing of RNA sequence data, gene sequence encoding, and subsequent pattern identification. We conducted experiments on two epilepsy datasets including GEO accession GSE264537 and GSE275235. The obtained results reveal several significant transcriptomic modifications, including reduced hippocampal astrogliosis after ketogenic diet treatment as well as restored excitatory-inhibitory signaling equilibrium in zebrafish epilepsy model. Moreover, our results highlight the effectiveness of leveraging LLMs in combination with advanced hardware acceleration for transcriptomic characterization in neurological diseases.

A Deep Learning Pipeline for Epilepsy Genomic Analysis Using GPT-2 XL and NVIDIA H100

TL;DR

Epilepsy transcriptomics presents high-dimensional, sparse data challenges that impede rapid molecular insight. The authors introduce a GPU-accelerated pipeline that fine-tunes GPT-2 XL on encoded transcriptomic data, combining token-based gene representations with classic dimensionality reduction and heatmap visualization. Their approach yields state-of-the-art predictive performance (AUC , F1 ) on two epilepsy datasets, while recovering biologically meaningful signatures such as GRIA1 upregulation, GRIA2 downregulation, interneuron markers SST and PVLAB dysregulation, and FOSB induction. NVIDIA's H100 GPUs dramatically accelerate training and inference, demonstrating the feasibility of transformer-based transcriptomics in data-limited neurogenomics contexts and paving the way for multimodal extensions in precision neuromedicine.

Abstract

Epilepsy is a chronic neurological condition characterized by recurrent seizures, with global prevalence estimated at 50 million people worldwide. While progress in high-throughput sequencing has allowed for broad-based transcriptomic profiling of brain tissues, the deciphering of these highly complex datasets remains one of the challenges. To address this issue, in this paper we propose a new analysis pipeline that integrates the power of deep learning strategies with GPU-acceleration computation for investigating Gene expression patterns in epilepsy. Specifically, our proposed approach employs GPT-2 XL, a transformer-based Large Language Model (LLM) with 1.5 billion parameters for genomic sequence analysis over the latest NVIDIA H100 Tensor Core GPUs based on Hopper architecture. Our proposed method enables efficient preprocessing of RNA sequence data, gene sequence encoding, and subsequent pattern identification. We conducted experiments on two epilepsy datasets including GEO accession GSE264537 and GSE275235. The obtained results reveal several significant transcriptomic modifications, including reduced hippocampal astrogliosis after ketogenic diet treatment as well as restored excitatory-inhibitory signaling equilibrium in zebrafish epilepsy model. Moreover, our results highlight the effectiveness of leveraging LLMs in combination with advanced hardware acceleration for transcriptomic characterization in neurological diseases.

Paper Structure

This paper contains 17 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: The visual overview of the workflow diagram of our proposed methodology.
  • Figure 2: Heatmap of raw gene expression counts (top 50 variable genes) in mouse epilepsy dataset (GSE264537). Samples (columns) are grouped by experimental condition (e.g. WT vs. KO). Notable expression patterns include upregulation of glutamatergic genes (e.g. GRIA1) and downregulation of GABAergic markers in the disease group.
  • Figure 3: Heatmap of normalized expression for top 50 genes in zebrafish epilepsy dataset (GSE275235). Distinct gene clusters separate mutant and control groups, reflecting excitatory/inhibitory gene expression imbalance in epileptic phenotypes.
  • Figure 4: PCA (panel A) and t-SNE (panel B) projections of GPT-2 XL-derived gene expression embeddings for the zebrafish epilepsy model (GSE275235). Each sample is color-coded by condition (mutant vs. control). In panel A, the first two principal components explain over 65% of total variance, producing two distinct, non-overlapping clusters that directly reflect the slc13a5 induced excitatory–inhibitory transcriptional shift. Panel B’s t-SNE projection further amplifies this separation into tightly bound, phenotype-specific groups, demonstrating that our transformer embeddings preserve nuanced biological structure even under highly compressed representation.