Table of Contents
Fetching ...

AAVGen: Precision Engineering of Adeno-associated Viral Capsids for Renal Selective Targeting

Mohammadreza Ghaffarzadeh-Esfahani, Yousof Gheisari

TL;DR

A generative artificial intelligence framework for de novo design of AAV capsids with enhanced multi-trait profiles that establishes a foundation for data-driven viral vector engineering, accelerating the development of next-generation AAV vectors with tailored functional characteristics.

Abstract

Adeno-associated viruses (AAVs) are promising vectors for gene therapy, but their native serotypes face limitations in tissue tropism, immune evasion, and production efficiency. Engineering capsids to overcome these hurdles is challenging due to the vast sequence space and the difficulty of simultaneously optimizing multiple functional properties. The complexity also adds when it comes to the kidney, which presents unique anatomical barriers and cellular targets that require precise and efficient vector engineering. Here, we present AAVGen, a generative artificial intelligence framework for de novo design of AAV capsids with enhanced multi-trait profiles. AAVGen integrates a protein language model (PLM) with supervised fine-tuning (SFT) and a reinforcement learning technique termed Group Sequence Policy Optimization (GSPO). The model is guided by a composite reward signal derived from three ESM-2-based regression predictors, each trained to predict a key property: production fitness, kidney tropism, and thermostability. Our results demonstrate that AAVGen produces a diverse library of novel VP1 protein sequences. In silico validations revealed that the majority of the generated variants have superior performance across all three employed indices, indicating successful multi-objective optimization. Furthermore, structural analysis via AlphaFold3 confirms that the generated sequences preserve the canonical capsid folding despite sequence diversification. AAVGen establishes a foundation for data-driven viral vector engineering, accelerating the development of next-generation AAV vectors with tailored functional characteristics.

AAVGen: Precision Engineering of Adeno-associated Viral Capsids for Renal Selective Targeting

TL;DR

A generative artificial intelligence framework for de novo design of AAV capsids with enhanced multi-trait profiles that establishes a foundation for data-driven viral vector engineering, accelerating the development of next-generation AAV vectors with tailored functional characteristics.

Abstract

Adeno-associated viruses (AAVs) are promising vectors for gene therapy, but their native serotypes face limitations in tissue tropism, immune evasion, and production efficiency. Engineering capsids to overcome these hurdles is challenging due to the vast sequence space and the difficulty of simultaneously optimizing multiple functional properties. The complexity also adds when it comes to the kidney, which presents unique anatomical barriers and cellular targets that require precise and efficient vector engineering. Here, we present AAVGen, a generative artificial intelligence framework for de novo design of AAV capsids with enhanced multi-trait profiles. AAVGen integrates a protein language model (PLM) with supervised fine-tuning (SFT) and a reinforcement learning technique termed Group Sequence Policy Optimization (GSPO). The model is guided by a composite reward signal derived from three ESM-2-based regression predictors, each trained to predict a key property: production fitness, kidney tropism, and thermostability. Our results demonstrate that AAVGen produces a diverse library of novel VP1 protein sequences. In silico validations revealed that the majority of the generated variants have superior performance across all three employed indices, indicating successful multi-objective optimization. Furthermore, structural analysis via AlphaFold3 confirms that the generated sequences preserve the canonical capsid folding despite sequence diversification. AAVGen establishes a foundation for data-driven viral vector engineering, accelerating the development of next-generation AAV vectors with tailored functional characteristics.
Paper Structure (35 sections, 8 equations, 6 figures)

This paper contains 35 sections, 8 equations, 6 figures.

Figures (6)

  • Figure 1: Development and assessment workflow of AAVGen. The left upper panel illustrates the dataset curation process. The right upper panel details the model training phase, which includes supervised fine-tuning (SFT), custom reward modeling, and group sequence policy optimization (GSPO). The lower panel outlines the assessment procedure, comprising generation analysis, evaluation of production fitness, kidney tropism, thermostability, and structural analysis.
  • Figure 2: Evaluation of regression models.(A) Training loss progression for models predicting production fitness, kidney tropism, and thermostability. (B) Correlation between experimentally determined (true) and model-predicted scores for production fitness, kidney tropism, and thermostability.
  • Figure 3: Reward progression in training with group sequence policy optimization (GSPO).(A) Total reward over the course of training. (B) Reward progression during training of AAVGen for reward functions that produce production fitness, kidney tropism, and thermostability. (C) Reward assignment logic for each objective, defined by the mean absolute error of model predictions on the validation set relative to the wild-type (WT) reference for production fitness, kidney tropism, and thermostability.
  • Figure 4: Sequence diversity and alignment metrics of the AAVGen-generated library.(A) Performance benchmarks of the generative model regarding sequence novelty. Left: Cumulative repetitiveness of generated sequences across increasing subset sizes ($N = 1{,}000$) for a library of 500,000 variants. Right: Distribution of sequence lengths for the training set (blue) and generated sequences (orange) relative to the wild-type (WT) AAV2 VP1 sequence (dashed line). (B) Alignment-based divergence of generated variants from the AAV2 WT reference. Left: Frequency distribution of edit distances from the WT sequence. The majority of synthetic variants contain between 10 and 15 mutations compared to the WT template. Right: Correlation between sequence similarity and identity percentages. Data points are colored by alignment score (ranging from 1425 to $>1460$), illustrating that the generated sequences maintain high conservation ($>98.5\%$ identity) while exploring a diverse landscape of substitution patterns.
  • Figure 5: Functional property analysis of generated AAVGen sequences.(A) Qualitative classification of generated sequences into “Best”, “Good”, “Uncertain”, and “Bad” categories based on predicted production fitness, kidney tropism, and thermostability scores. (B) Pairwise correlation analyses between predicted production fitness, kidney tropism, and thermostability scores. (C) Joint three-dimensional distribution of predicted production fitness, kidney tropism, and thermostability scores, with each point colored according to the average of these three scores.
  • ...and 1 more figures