Table of Contents
Fetching ...

AAVDiff: Experimental Validation of Enhanced Viability and Diversity in Recombinant Adeno-Associated Virus (AAV) Capsids through Diffusion Generation

Lijun Liu, Jiali Yang, Jianfei Song, Xinglin Yang, Lele Niu, Zeqi Cai, Hui Shi, Tingjun Hou, Chang-yu Hsieh, Weiran Shen, Yafeng Deng

TL;DR

The paper addresses the challenge of efficiently discovering viable AAV capsid variants within an enormous sequence space. It introduces an end-to-end discrete diffusion model trained on public AAV2 data to generate large libraries of VP sequences, then validates viability experimentally on AAV2 region VIII and AAV9 region VIII/IV/V mutations, as well as cross-serotype transfer. Key findings include high viability rates for multi-mutant AAV2 designs and meaningful viability after transferring sequences to AAV9, demonstrating improved design efficiency and cross-serotype applicability. This approach promises to accelerate capsid engineering for targeted, efficient gene delivery with potential clinical impact in tissue-specific therapies.

Abstract

Recombinant adeno-associated virus (rAAV) vectors have revolutionized gene therapy, but their broad tropism and suboptimal transduction efficiency limit their clinical applications. To overcome these limitations, researchers have focused on designing and screening capsid libraries to identify improved vectors. However, the large sequence space and limited resources present challenges in identifying viable capsid variants. In this study, we propose an end-to-end diffusion model to generate capsid sequences with enhanced viability. Using publicly available AAV2 data, we generated 38,000 diverse AAV2 viral protein (VP) sequences, and evaluated 8,000 for viral selection. The results attested the superiority of our model compared to traditional methods. Additionally, in the absence of AAV9 capsid data, apart from one wild-type sequence, we used the same model to directly generate a number of viable sequences with up to 9 mutations. we transferred the remaining 30,000 samples to the AAV9 domain. Furthermore, we conducted mutagenesis on AAV9 VP hypervariable regions VI and V, contributing to the continuous improvement of the AAV9 VP sequence. This research represents a significant advancement in the design and functional validation of rAAV vectors, offering innovative solutions to enhance specificity and transduction efficiency in gene therapy applications.

AAVDiff: Experimental Validation of Enhanced Viability and Diversity in Recombinant Adeno-Associated Virus (AAV) Capsids through Diffusion Generation

TL;DR

The paper addresses the challenge of efficiently discovering viable AAV capsid variants within an enormous sequence space. It introduces an end-to-end discrete diffusion model trained on public AAV2 data to generate large libraries of VP sequences, then validates viability experimentally on AAV2 region VIII and AAV9 region VIII/IV/V mutations, as well as cross-serotype transfer. Key findings include high viability rates for multi-mutant AAV2 designs and meaningful viability after transferring sequences to AAV9, demonstrating improved design efficiency and cross-serotype applicability. This approach promises to accelerate capsid engineering for targeted, efficient gene delivery with potential clinical impact in tissue-specific therapies.

Abstract

Recombinant adeno-associated virus (rAAV) vectors have revolutionized gene therapy, but their broad tropism and suboptimal transduction efficiency limit their clinical applications. To overcome these limitations, researchers have focused on designing and screening capsid libraries to identify improved vectors. However, the large sequence space and limited resources present challenges in identifying viable capsid variants. In this study, we propose an end-to-end diffusion model to generate capsid sequences with enhanced viability. Using publicly available AAV2 data, we generated 38,000 diverse AAV2 viral protein (VP) sequences, and evaluated 8,000 for viral selection. The results attested the superiority of our model compared to traditional methods. Additionally, in the absence of AAV9 capsid data, apart from one wild-type sequence, we used the same model to directly generate a number of viable sequences with up to 9 mutations. we transferred the remaining 30,000 samples to the AAV9 domain. Furthermore, we conducted mutagenesis on AAV9 VP hypervariable regions VI and V, contributing to the continuous improvement of the AAV9 VP sequence. This research represents a significant advancement in the design and functional validation of rAAV vectors, offering innovative solutions to enhance specificity and transduction efficiency in gene therapy applications.
Paper Structure (13 sections, 5 equations, 9 figures, 2 tables)

This paper contains 13 sections, 5 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: a: Distribution of features of sequences generated by the diffusion model compared to the training set. The left graph represents the feature distribution after dimensionality reduction using t-SNE, while the right graph represents the feature distribution after dimensionality reduction using PCA. Class 1 represents the generated sequences, while classes 2 and 3 represent sequences from the training set. b: Distribution of sequence lengths for sequences generated by the diffusion model compared to the training set. The x-axis represents the length of the sequences, and the y-axis represents the frequency. The green color represents the generated sequences, while the other two colors represent sequences from the training set. c: Distribution of the number of mutation sites for sequences generated by the diffusion model compared to the training set. The x-axis represents the number of mutation sites, and the y-axis represents the frequency. The green color represents the generated sequences, while the other two colors represent sequences from the training set. d: Distribution of different lengths of consecutive insertions generated by the diffusion model. The x-axis represents the length of consecutive insertions, and the y-axis represents the proportion of samples. e: Distribution of the number of clusters for sequences generated by the diffusion model and the CNN model. The x-axis represents the clustering radius (sequences with a difference in mutation count within this radius are considered in the same cluster), and the y-axis represents the number of clusters. f: Proportion of viable samples for sequences generated by the diffusion model at different numbers of mutation sites. The x-axis represents different numbers of mutation sites, and the y-axis represents the proportion of viable samples.
  • Figure 2: a: Proportion of samples generated by the diffusion model at different numbers of mutation sites, where blue represents AAV2 and red represents AAV9. b: Proportion of viable samples for sequences generated by the diffusion model at different numbers of mutation sites. The x-axis represents different numbers of mutation sites, and the y-axis represents the proportion of viable samples.
  • Figure 3: a: Distribution of activity values for single-mutant sequences. The x-axis represents the activity values of the sequences, and the y-axis represents the frequency of sequences within that activity value range. b: Trend of activity values for single-mutant sequences. The x-axis represents each sequence, and the y-axis represents the normalized activity values. c: Enrichment levels of single-mutant sequences at different mutation positions. From top to bottom are the enrichment levels of sequences with single mutations in regions VIII, IV, and V of AAV9. The x-axis represents the mutation positions in the current region. If the mutation position is a float value, it indicates an insertion between two integer positions. The y-axis represents the amino acid type after the mutation, where "-" represents the deletion of the amino acid at the current position. Black dots represent the positions of wild-type sequences in that region. d: Enrichment levels of single-mutant sequences. From top to bottom are the enrichment levels of sequences with single mutations in regions VIII, IV, and V of AAV9. The x-axis represents the frequency of plasmids after sequencing, and the y-axis represents the frequency of viruses after sequencing.
  • Figure 4: The directed graphical model considered in this work
  • Figure S1: a: Distribution of Activity Values. where the x-axis represents the activity values and the y-axis represents the frequency of sequences within that activity value range. b: Trend of Activity. where the x-axis represents the selected 1000 samples and the y-axis represents the normalized activity values. In this figure, red color represents sequences labeled as non-viable in the dyno published dataset, while purple color represents sequences labeled as viable.
  • ...and 4 more figures