Table of Contents
Fetching ...

HEMERA: A Human-Explainable Transformer Model for Estimating Lung Cancer Risk using GWAS Data

Maria Mahbub, Robert J. Klein, Myvizhi Esai Selvan, Rowena Yip, Claudia Henschke, Providencia Morales, Ian Goethert, Olivera Kotevska, Mayanka Chandra Shekar, Sean R. Wilkinson, Eileen McAllister, Samuel M. Aguayo, Zeynep H. Gümüş, Ioana Danciu, VA Million Veteran Program

TL;DR

HEMERA (Human-Explainable Transformer Model for Estimating Lung Cancer Risk using GWAS Data), a new framework that applies explainable transformer-based deep learning to GWAS data of single nucleotide polymorphisms (SNPs) for predicting LC risk, is introduced.

Abstract

Lung cancer (LC) is the third most common cancer and the leading cause of cancer deaths in the US. Although smoking is the primary risk factor, the occurrence of LC in never-smokers and familial aggregation studies highlight a genetic component. Genetic biomarkers identified through genome-wide association studies (GWAS) are promising tools for assessing LC risk. We introduce HEMERA (Human-Explainable Transformer Model for Estimating Lung Cancer Risk using GWAS Data), a new framework that applies explainable transformer-based deep learning to GWAS data of single nucleotide polymorphisms (SNPs) for predicting LC risk. Unlike prior approaches, HEMERA directly processes raw genotype data without clinical covariates, introducing additive positional encodings, neural genotype embeddings, and refined variant filtering. A post hoc explainability module based on Layer-wise Integrated Gradients enables attribution of model predictions to specific SNPs, aligning strongly with known LC risk loci. Trained on data from 27,254 Million Veteran Program participants, HEMERA achieved >99% AUC (area under receiver characteristics) score. These findings support transparent, hypothesis-generating models for personalized LC risk assessment and early intervention.

HEMERA: A Human-Explainable Transformer Model for Estimating Lung Cancer Risk using GWAS Data

TL;DR

HEMERA (Human-Explainable Transformer Model for Estimating Lung Cancer Risk using GWAS Data), a new framework that applies explainable transformer-based deep learning to GWAS data of single nucleotide polymorphisms (SNPs) for predicting LC risk, is introduced.

Abstract

Lung cancer (LC) is the third most common cancer and the leading cause of cancer deaths in the US. Although smoking is the primary risk factor, the occurrence of LC in never-smokers and familial aggregation studies highlight a genetic component. Genetic biomarkers identified through genome-wide association studies (GWAS) are promising tools for assessing LC risk. We introduce HEMERA (Human-Explainable Transformer Model for Estimating Lung Cancer Risk using GWAS Data), a new framework that applies explainable transformer-based deep learning to GWAS data of single nucleotide polymorphisms (SNPs) for predicting LC risk. Unlike prior approaches, HEMERA directly processes raw genotype data without clinical covariates, introducing additive positional encodings, neural genotype embeddings, and refined variant filtering. A post hoc explainability module based on Layer-wise Integrated Gradients enables attribution of model predictions to specific SNPs, aligning strongly with known LC risk loci. Trained on data from 27,254 Million Veteran Program participants, HEMERA achieved >99% AUC (area under receiver characteristics) score. These findings support transparent, hypothesis-generating models for personalized LC risk assessment and early intervention.

Paper Structure

This paper contains 19 sections, 2 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: HEMERA: a human-explainable transformer model for estimating lung cancer risk using GWAS data
  • Figure 2: Matched distribution of age, sex, ancestry, and smoking status between cases and controls in the study cohort.
  • Figure 3: Ablation study with varying transformer depth and minor allele frequency (MAF) threshold. a, Effect of transformer depth on model performance, assessed by varying the number of encoder layers from 1 to 6 while keeping the number of attention heads fixed at 1. b, Effect of MAF threshold on model performance, assessed by varying the MAF thresholds 0.01, 0.05, 0.1, 0.2, 0.3, and 0.4.
  • Figure 4: Model performance across 5-fold cross-validation.
  • Figure 5: Manhattan-style plot of SNP attribution scores across the genome. Each point represents a single nucleotide polymorphism (SNP), with its genomic position on the x-axis and its average positive attribution score (with respect to lung cancer prediction) on the y-axis. Chromosomes are concatenated end-to-end along the x-axis and alternately colored for visual clarity. Only SNPs with positive attribution scores are shown, highlighting features that contribute positively to the model’s classification of lung cancer.
  • ...and 1 more figures