Table of Contents
Fetching ...

Transfer Learning for T-Cell Response Prediction

Josua Stadelmaier, Brandon Malone, Ralf Eggeling

TL;DR

This work tackles the challenge of predicting T-cell responses to peptides in a multi-domain setting where data come from diverse sources and MHC alleles, risking shortcut learning. It introduces a domain-aware evaluation framework and explores transfer-learning strategies, including adversarial domain adaptation (ADA-T) and per-source fine-tuning (FINE-T) on a transformer-based predictor. ADA-T reduces domain-specific shortcuts but does not consistently improve accuracy, while FINE-T yields robust gains, particularly for MHC I, and achieves state-of-the-art-like performance on human peptides. Overall, the study highlights the importance of accounting for data heterogeneity in immunogenicity prediction and points to FINE-T as a practical approach for personalized cancer vaccine design, while calling for standardized benchmarks to enable fair comparisons.

Abstract

We study the prediction of T-cell response for specific given peptides, which could, among other applications, be a crucial step towards the development of personalized cancer vaccines. It is a challenging task due to limited, heterogeneous training data featuring a multi-domain structure; such data entail the danger of shortcut learning, where models learn general characteristics of peptide sources, such as the source organism, rather than specific peptide characteristics associated with T-cell response. Using a transformer model for T-cell response prediction, we show that the danger of inflated predictive performance is not merely theoretical but occurs in practice. Consequently, we propose a domain-aware evaluation scheme. We then study different transfer learning techniques to deal with the multi-domain structure and shortcut learning. We demonstrate a per-source fine tuning approach to be effective across a wide range of peptide sources and further show that our final model is competitive with existing state-of-the-art approaches for predicting T-cell responses for human peptides.

Transfer Learning for T-Cell Response Prediction

TL;DR

This work tackles the challenge of predicting T-cell responses to peptides in a multi-domain setting where data come from diverse sources and MHC alleles, risking shortcut learning. It introduces a domain-aware evaluation framework and explores transfer-learning strategies, including adversarial domain adaptation (ADA-T) and per-source fine-tuning (FINE-T) on a transformer-based predictor. ADA-T reduces domain-specific shortcuts but does not consistently improve accuracy, while FINE-T yields robust gains, particularly for MHC I, and achieves state-of-the-art-like performance on human peptides. Overall, the study highlights the importance of accounting for data heterogeneity in immunogenicity prediction and points to FINE-T as a practical approach for personalized cancer vaccine design, while calling for standardized benchmarks to enable fair comparisons.

Abstract

We study the prediction of T-cell response for specific given peptides, which could, among other applications, be a crucial step towards the development of personalized cancer vaccines. It is a challenging task due to limited, heterogeneous training data featuring a multi-domain structure; such data entail the danger of shortcut learning, where models learn general characteristics of peptide sources, such as the source organism, rather than specific peptide characteristics associated with T-cell response. Using a transformer model for T-cell response prediction, we show that the danger of inflated predictive performance is not merely theoretical but occurs in practice. Consequently, we propose a domain-aware evaluation scheme. We then study different transfer learning techniques to deal with the multi-domain structure and shortcut learning. We demonstrate a per-source fine tuning approach to be effective across a wide range of peptide sources and further show that our final model is competitive with existing state-of-the-art approaches for predicting T-cell responses for human peptides.
Paper Structure (16 sections, 4 equations, 5 figures, 1 table)

This paper contains 16 sections, 4 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Domain structure of T-cell response data. (a) Distribution of T-cell response positives and negatives per peptide source. (b) Same plot for MHC alleles. (c) Clusters (indexed with A-G) of similar peptides (left) and the sources of peptides the clusters consist of (right). Numbers show peptide counts. (d) Sequence logos for the MHC alleles HLA-A*02:01 (first row) and HLA-A*11:01 (second row). The columns represent T-cell response positives (left) and negatives (right).
  • Figure 2: Model architecture for T-cell response prediction. Shading of boxes indicates with which objectives the components are trained. Boxes with dashed borders are only used in the adversarial domain adaptation setting.
  • Figure 3: Shortcut learning and the effect of adversarial domain adaptation. The left column shows results for BASE-T and the right column shows corresponding results for ADA-T with adversarial domain adaptation being applied on peptide sources. (a) Model performance on validation data with different settings of accounting for shortcuts in the evaluation. For the "allele adjusted" performance, peptides are grouped by MHC alleles in the evaluation to detect shortcuts based on MHC alleles. "Source adjusted" is analogous for grouping peptides by their source. (b) Distribution of prediction scores, separated by the T-cell response labels. For each label, two distributions are shown that correspond to the two most frequent peptide sources, Human betaherpesvirus 6B (majority of labels is positive) and Vaccinia virus (majority of labels is negative). The markers on the x-axis show the mean prediction scores of the two distributions. (c) t-SNE visualization of the latent peptide representations $\mathbf{h}$.
  • Figure 4: Validation performance of BASE-T and FINE-T models for several peptide sources. Results for MHC I are shown in (a) and for MHC II in (b). For each MHC class, the five most frequent peptide sources are selected. Peptide sources with only positives or only negatives in one of the test data partitions are excluded.
  • Figure 5: Test performance on human peptides presented on (a) MHC class I and (b) MHC class II. For both MHC classes, the first two rows correspond to existing models from the literature, which are trained on other data sets. Bag-of-AA baseline and FINE-T are trained on the same data. Blue bars correspond to the evaluation on the previously completely unused test set. Grey bars show the mean AUC from the final nested cross validation.