Incorporating LLM Embeddings for Variation Across the Human Genome

Hongqian Niu; Jordan Bryan; Jacob Williams; Hufeng Zhou; Haoyu Zhang; Xihao Li; Didong Li

Incorporating LLM Embeddings for Variation Across the Human Genome

Hongqian Niu, Jordan Bryan, Jacob Williams, Hufeng Zhou, Haoyu Zhang, Xihao Li, Didong Li

Abstract

Recent advances in large language model (LLM) embeddings have enabled powerful representations for biological data, but most applications to date focus on gene-level information. We present one of the first systematic frameworks to generate genetic variant-level embeddings across the entire human genome. Using curated annotations from FAVOR, ClinVar, and the GWAS Catalog, we construct functional text descriptions for 8.9 billion possible variants and generated embeddings at three scales: 1.5 million HapMap3/MEGA variants, 90 million imputed UK Biobank (UKB) variants, and 9 billion all possible variants. Embeddings were produced using general purpose models including both OpenAI's text-embedding-3-large and the open-source Qwen3-Embedding-0.6B models. Baseline quality control experiments demonstrate high predictive accuracy for variant-level properties, validating the embeddings as structured representations of genomic variation. We further apply them to real-world embedding-augmented genetic risk predictions that demonstrate the performance of using LLM embeddings in polygenic risk score (PRS) style predictions over the UK Biobank cohort data. These resources, publicly available on Hugging Face, provide a foundation for advancing large-scale genomic discovery and precision medicine.

Incorporating LLM Embeddings for Variation Across the Human Genome

Abstract

Incorporating LLM Embeddings for Variation Across the Human Genome

Abstract

Paper Structure

Table of Contents

Figures (13)