Table of Contents
Fetching ...

Incorporating LLM Embeddings for Variation Across the Human Genome

Hongqian Niu, Jordan Bryan, Jacob Williams, Hufeng Zhou, Haoyu Zhang, Xihao Li, Didong Li

Abstract

Recent advances in large language model (LLM) embeddings have enabled powerful representations for biological data, but most applications to date focus on gene-level information. We present one of the first systematic frameworks to generate genetic variant-level embeddings across the entire human genome. Using curated annotations from FAVOR, ClinVar, and the GWAS Catalog, we construct functional text descriptions for 8.9 billion possible variants and generated embeddings at three scales: 1.5 million HapMap3/MEGA variants, 90 million imputed UK Biobank (UKB) variants, and 9 billion all possible variants. Embeddings were produced using general purpose models including both OpenAI's text-embedding-3-large and the open-source Qwen3-Embedding-0.6B models. Baseline quality control experiments demonstrate high predictive accuracy for variant-level properties, validating the embeddings as structured representations of genomic variation. We further apply them to real-world embedding-augmented genetic risk predictions that demonstrate the performance of using LLM embeddings in polygenic risk score (PRS) style predictions over the UK Biobank cohort data. These resources, publicly available on Hugging Face, provide a foundation for advancing large-scale genomic discovery and precision medicine.

Incorporating LLM Embeddings for Variation Across the Human Genome

Abstract

Recent advances in large language model (LLM) embeddings have enabled powerful representations for biological data, but most applications to date focus on gene-level information. We present one of the first systematic frameworks to generate genetic variant-level embeddings across the entire human genome. Using curated annotations from FAVOR, ClinVar, and the GWAS Catalog, we construct functional text descriptions for 8.9 billion possible variants and generated embeddings at three scales: 1.5 million HapMap3/MEGA variants, 90 million imputed UK Biobank (UKB) variants, and 9 billion all possible variants. Embeddings were produced using general purpose models including both OpenAI's text-embedding-3-large and the open-source Qwen3-Embedding-0.6B models. Baseline quality control experiments demonstrate high predictive accuracy for variant-level properties, validating the embeddings as structured representations of genomic variation. We further apply them to real-world embedding-augmented genetic risk predictions that demonstrate the performance of using LLM embeddings in polygenic risk score (PRS) style predictions over the UK Biobank cohort data. These resources, publicly available on Hugging Face, provide a foundation for advancing large-scale genomic discovery and precision medicine.

Paper Structure

This paper contains 22 sections, 8 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Sample annotation for an SNV at position 148992859 on chromosome 5 with reference allele C and alternate allele A based on the GRCh38/hg38 genome build.
  • Figure 2: Histogram of token counts for derived annotations for the UKB 1.5M SNV list.
  • Figure 3: Examples of variant annotations at different lengths and supporting information.
  • Figure 4: Chromosome number prediction task using variant-level embeddings from a) OpenAI text-embedding-3-large with prediction accuracy of greater than 99%, and b) Qwen3-Embedding-0.6B with prediction accuracy of 88%.
  • Figure 5: Reference allele prediction with a) OpenAI text-embedding-3-large with prediction accuracy of 92% and b) Qwen3-Embedding-0.6B with prediction accuracy of 86%.
  • ...and 8 more figures