GV-Rep: A Large-Scale Dataset for Genetic Variant Representation Learning

Zehui Li; Vallijah Subasri; Guy-Bart Stan; Yiren Zhao; Bo Wang

GV-Rep: A Large-Scale Dataset for Genetic Variant Representation Learning

Zehui Li, Vallijah Subasri, Guy-Bart Stan, Yiren Zhao, Bo Wang

TL;DR

GV-Rep tackles the challenge of interpreting rapidly expanding genetic variant data by delivering a large-scale, context-rich GV dataset for representation learning. The authors assemble over 7.5 million GV records from multiple public sources and ClinVar, standardize inputs as (ref, alt, annotation) triplets, and annotate them with diverse labels to train and evaluate genomic foundation models. Through experiments with four GFMs and a fine-tuning regime, they reveal a substantial gap between current model capabilities and robust GV representations, while also demonstrating improved GV indexing when using fine-tuned encoders. The dataset and accompanying analyses provide a practical framework for learning GV representations and benchmarking GV retrieval across complex, context-dependent genomic scenarios, with implications for improved clinical prioritization and variant interpretation.

Abstract

Genetic variants (GVs) are defined as differences in the DNA sequences among individuals and play a crucial role in diagnosing and treating genetic diseases. The rapid decrease in next generation sequencing cost has led to an exponential increase in patient-level GV data. This growth poses a challenge for clinicians who must efficiently prioritize patient-specific GVs and integrate them with existing genomic databases to inform patient management. To addressing the interpretation of GVs, genomic foundation models (GFMs) have emerged. However, these models lack standardized performance assessments, leading to considerable variability in model evaluations. This poses the question: How effectively do deep learning methods classify unknown GVs and align them with clinically-verified GVs? We argue that representation learning, which transforms raw data into meaningful feature spaces, is an effective approach for addressing both indexing and classification challenges. We introduce a large-scale Genetic Variant dataset, named GV-Rep, featuring variable-length contexts and detailed annotations, designed for deep learning models to learn GV representations across various traits, diseases, tissue types, and experimental contexts. Our contributions are three-fold: (i) Construction of a comprehensive dataset with 7 million records, each labeled with characteristics of the corresponding variants, alongside additional data from 17,548 gene knockout tests across 1,107 cell types, 1,808 variant combinations, and 156 unique clinically verified GVs from real-world patients. (ii) Analysis of the structure and properties of the dataset. (iii) Experimentation of the dataset with pre-trained GFMs. The results show a significant gap between GFMs current capabilities and accurate GV representation. We hope this dataset will help advance genomic deep learning to bridge this gap.

GV-Rep: A Large-Scale Dataset for Genetic Variant Representation Learning

TL;DR

Abstract

Paper Structure (80 sections, 7 figures, 4 tables)

This paper contains 80 sections, 7 figures, 4 tables.

Introduction
Preliminaries and Related Work
Dataset
Dataset Overview
Dataset Construction
Dataset Description
Dataset Statistics and Analysis
Statistics of Clinically Verified GVs
Experiment with Genomic Foundation Models
Experiment Setup
Variant Property Prediction
Scaling Law in GV Prediction
Fine-Grained and Coarse-Grained Tasks
Genetic Variants Indexing
Comparison of Indexing Accuracy between Original and Finetuned GFMs
...and 65 more sections

Figures (7)

Figure 1: Overview of the proposed dataset pipeline The input includes clinician-verified genetic variants from multiple sources like ClinVar and GTEx. These are processed through data cleaning, sequence extraction, and unified formatting. The resulting data is used in genomic foundation models for various tasks such as prediction and indexing.
Figure 2: Dataset Construction and Usage This diagram give an example on the construction workflow of GV-Rep dataset from a source database. Genetic variant records, containing chromosome position and reference/alternate alleles, along with biospecimen-specific annotations and a binary label indicating the significance of the GV, are extracted from source GV database. The sequence extractor processes these GV records, which can then be used by GFMs for predicting the significance of unknown genetic variants. The finetuned GFMs could encode and index unknown GVs by matching them with GVs in the databases.
Figure 3: Distributions of Genetic Variants by Chromosome. The distribution of GVs are relatively uniform across various chromosomes.
Figure 4: (a) Distributions of Diseases and Trait Labels (b) Gene- KO Fitness Score Distributions
Figure 5: Scaling Law of Genomic Foundation Models in ClinVar Lung Disease Classification. The plot shows the accuracy of various models (HyenaDNA, DNABERT2, NT, and NT_v2) vs. sequence length. The context length extends on both sides of the mutated nucleotides of genetic variants.
...and 2 more figures

GV-Rep: A Large-Scale Dataset for Genetic Variant Representation Learning

TL;DR

Abstract

GV-Rep: A Large-Scale Dataset for Genetic Variant Representation Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (7)