Table of Contents
Fetching ...

Physicochemically Informed Dual-Conditioned Generative Model of T-Cell Receptor Variable Regions for Cellular Therapy

Jiahao Ma, Hongzong Li, Ye-Fan Hu, Jian-Dong Huang

TL;DR

PhysicoGPTCR tackles the problem of generating TCR variable regions that are novel, diverse, and biophysically plausible within a given peptide–MHC context. It introduces a dual-conditioned Transformer that fuses peptide and HLA inputs with residue-level physicochemical embeddings to model $p_ heta(t \mid m, p)$ in an end-to-end fashion. Across multiple benchmarks against baselines, it achieves superior string-based metrics and shows a higher proportion of docking-competent clones, validated through in-silico analyses and case studies. This approach promises to dramatically shorten the TCR discovery timeline from months to minutes while maintaining downstream verifiability, enabling rapid, personalized cellular therapies.

Abstract

Physicochemically informed biological sequence generation has the potential to accelerate computer-aided cellular therapy, yet current models fail to \emph{jointly} ensure novelty, diversity, and biophysical plausibility when designing variable regions of T-cell receptors (TCRs). We present \textbf{PhysicoGPTCR}, a large generative protein Transformer that is \emph{dual-conditioned} on peptide and HLA context and trained to autoregressively synthesise TCR sequences while embedding residue-level physicochemical descriptors. The model is optimised on curated TCR--peptide--HLA triples with a maximum-likelihood objective and compared against ANN, GPTCR, LSTM, and VAE baselines. Across multiple neoantigen benchmarks, PhysicoGPTCR substantially improves edit-distance, similarity, and longest-common-subsequence scores, while populating a broader region of sequence space. Blind in-silico docking and structural modelling further reveal a higher proportion of binding-competent clones than the strongest baseline, validating the benefit of explicit context conditioning and physicochemical awareness. Experimental results demonstrate that dual-conditioned, physics-grounded generative modelling enables end-to-end design of functional TCR candidates, reducing the discovery timeline from months to minutes without sacrificing wet-lab verifiability.

Physicochemically Informed Dual-Conditioned Generative Model of T-Cell Receptor Variable Regions for Cellular Therapy

TL;DR

PhysicoGPTCR tackles the problem of generating TCR variable regions that are novel, diverse, and biophysically plausible within a given peptide–MHC context. It introduces a dual-conditioned Transformer that fuses peptide and HLA inputs with residue-level physicochemical embeddings to model in an end-to-end fashion. Across multiple benchmarks against baselines, it achieves superior string-based metrics and shows a higher proportion of docking-competent clones, validated through in-silico analyses and case studies. This approach promises to dramatically shorten the TCR discovery timeline from months to minutes while maintaining downstream verifiability, enabling rapid, personalized cellular therapies.

Abstract

Physicochemically informed biological sequence generation has the potential to accelerate computer-aided cellular therapy, yet current models fail to \emph{jointly} ensure novelty, diversity, and biophysical plausibility when designing variable regions of T-cell receptors (TCRs). We present \textbf{PhysicoGPTCR}, a large generative protein Transformer that is \emph{dual-conditioned} on peptide and HLA context and trained to autoregressively synthesise TCR sequences while embedding residue-level physicochemical descriptors. The model is optimised on curated TCR--peptide--HLA triples with a maximum-likelihood objective and compared against ANN, GPTCR, LSTM, and VAE baselines. Across multiple neoantigen benchmarks, PhysicoGPTCR substantially improves edit-distance, similarity, and longest-common-subsequence scores, while populating a broader region of sequence space. Blind in-silico docking and structural modelling further reveal a higher proportion of binding-competent clones than the strongest baseline, validating the benefit of explicit context conditioning and physicochemical awareness. Experimental results demonstrate that dual-conditioned, physics-grounded generative modelling enables end-to-end design of functional TCR candidates, reducing the discovery timeline from months to minutes without sacrificing wet-lab verifiability.

Paper Structure

This paper contains 53 sections, 9 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Tasks similar to TCR generation and the workflow. (A) Protein generation analogies. Antibodies can be generated based on antigen inputs, applied to immunotherapy or neutralizing antibodies. Enzymes can be generated for distinct substrates to improve bio-manufacturer. TCR generation is similar to previous two tasks. By requiring peptide-MHC inputs, TCR can be generated for cellular therapies. (B) PhysicoGPTCR workflow: the model processes peptide sequences and MHC pseudo-sequences as inputs, leveraging sequence motifs and physiochemical features through PhysicoGPTCR, followed by a decoder that outputs TCR CDR3 sequences.
  • Figure 2: Model overview. Three information channels--token identity, positional index and residue-level physicochemical descriptors--are fused by a gated projector and fed into a lightweight 2 + 2-layer Transformer that is conditioned on both peptide and HLA context.
  • Figure 3: The inference and post-processing of TCR generation. The pipeline consists of four steps: (1) Multi-start Generation: beam search produces 1 024 raw sequences. (2) Legality Filter: sequences are filtered by length (10--18 residues) and uniqueness. (3) Likelihood Scoring: retained candidates are scored and ranked by the negative length-normalised log-likelihood. (4) Diversity Selection: the top 20 sequences are chosen via maximum--marginal--relevance (MMR) to balance binding affinity and sequence diversity.
  • Figure 4: Comparison across three sequence‐level metrics on the 6 200-sample test set (lower Levenshtein $\downarrow$ and higher Similarity/LCS $\uparrow$ indicate better performance).
  • Figure 5: Model performance across contexts. (A) Sequence-level metrics per MHC allele (mean $\pm$ standard deviation, $n\ge 150$ each). (B) Per-epitope sequence-level metrics (mean $\pm$ std). Lower Levenshtein $\downarrow$ and higher Similarity/LCS $\uparrow$ indicate better performance. Red dashed lines show averaged metrics of PhysicoGPTCR.
  • ...and 2 more figures