Contrastive learning of T cell receptor representations

Yuta Nagano; Andrew Pyo; Martina Milighetti; James Henderson; John Shawe-Taylor; Benny Chain; Andreas Tiffeau-Mayer

Contrastive learning of T cell receptor representations

Yuta Nagano, Andrew Pyo, Martina Milighetti, James Henderson, John Shawe-Taylor, Benny Chain, Andreas Tiffeau-Mayer

TL;DR

This model introduces a TCR language model called SCEPTR (simple contrastive embedding of the primary sequence of T cell receptors), which is capable of data-efficient transfer learning and introduces a pre-training strategy combining autocontrastive learning and masked-language modeling, which enables SCEPTR to achieve its state-of-the-art performance.

Abstract

Computational prediction of the interaction of T cell receptors (TCRs) and their ligands is a grand challenge in immunology. Despite advances in high-throughput assays, specificity-labelled TCR data remains sparse. In other domains, the pre-training of language models on unlabelled data has been successfully used to address data bottlenecks. However, it is unclear how to best pre-train protein language models for TCR specificity prediction. Here we introduce a TCR language model called SCEPTR (Simple Contrastive Embedding of the Primary sequence of T cell Receptors), capable of data-efficient transfer learning. Through our model, we introduce a novel pre-training strategy combining autocontrastive learning and masked-language modelling, which enables SCEPTR to achieve its state-of-the-art performance. In contrast, existing protein language models and a variant of SCEPTR pre-trained without autocontrastive learning are outperformed by sequence alignment-based methods. We anticipate that contrastive learning will be a useful paradigm to decode the rules of TCR specificity.

Contrastive learning of T cell receptor representations

TL;DR

Abstract

Paper Structure (22 sections, 4 equations, 23 figures, 2 tables)

This paper contains 22 sections, 4 equations, 23 figures, 2 tables.

Results
Benchmarking PLM embeddings on TCR specificity prediction
Autocontrastive learning as a pre-training strategy
Ablation studies
Comparison of SCEPTR embeddings to alignment-based TCR similarity
Supervised contrastive learning as a fine-tuning strategy
Discussion
Methods
Model benchmarking
SCEPTR architecture
SCEPTR Pre-training
Data
Procedure
SCEPTR fine-tuning with supervised contrastive learning
Data
...and 7 more sections

Figures (23)

Figure 1: Benchmarking TCR language models against sequence alignment-based approaches on few-shot TCR specificity prediction.a) TCR similarity can be quantified using sequence-alignment by taking a (weighted) count of how many sequence edits turn one TCR into another. b) Learned sequence representations allow alignment-free sequence comparisons based on distances in the embedding feature space. c) Sketch of our standardized benchmarking approach to allow side-by-side comparison of sequence-alignment and embedding methods. Using a reference set of known TCR binders to a pMHC of interest, we propose nearest neighbour prediction as a task for unbiased comparison of the quality of embeddings for specificity prediction. d) Performance of six different models on TCR specificity prediction as a function of the number of reference TCRs. Specificity predictions were made by the nearest neighbour method sketched in c against six different pMHCs and performance is reported as the AUROC averaged across the pMHCs. The error bars represent standard deviations of model AUROCs relative to the average across all models within a data split.
Figure 2: A visual introduction to how SCEPTR works.a) SCEPTR featurises an input TCR as the amino acid sequences of its six CDR loops. Each amino acid residue is vectorised to $\mathbb{R}^{64}$ (see panel b) and are passed along with the special <cls> token vector through a stack of three self-attention layers. SCEPTR uses the contextualised embedding of the <cls> token as the overall TCR representation, in contrast to the average-pooling representations used by other models. b) SCEPTR's initial token embedding module uses a simple one-hot system to encode a token's amino acid identity and CDR loop number, and allocates one dimension to encode the token's relative position within its CDR loop as a single real-valued scalar. c) Contrastive learning allows us to explicitly optimise SCEPTR's representation mapping for TCR co-specificity prediction. At a high level, contrastive learning encourages representation models to make full use of the available representation space while keeping representations of similar input samples close together. d) Contrastive learning generalises to both the supervised and unsupervised settings. In the supervised setting, positive pairs can be generated by sampling pairs of TCRs that are known to bind the same pMHC. In the unsupervised setting, positive pairs can be generated by generating two independent "views" of the same TCR. We implement this by only showing a random subset of the input data features for every view -- namely, we remove a proportion of input tokens and sometimes drop the $\alpha$ or $\beta$ chain entirely (see methods \ref{['sec:methods_sceptr_pretraining']}).
Figure 3: Autocontrastive pre-training significantly improves SCEPTR's downstream performance. The subplots show performance profiles of SCEPTR, TCRdist, TCR-BERT, and various ablation variants of SCEPTR on binary specificity prediction. a) Training SCEPTR solely on MLM results in worse specificity prediction performance. b) The baseline SCEPTR variant which uses the <cls> pooling method performs marginally better than the variant which uses the average-pooling method. However, the average-pooling variant still performs on par with TCRdist. c) Replacing SCEPTR's pre-training dataset with 1) the same dataset from Tanno et al., but with $\alpha/\beta$ chain pairing shuffled, and 2) synthetic data generated by OLGA both result in similar specificity prediction performance. d) Restricting SCEPTR's featurisation of input TCRs to the amino acids of the $\alpha$ and $\beta$ CDR3 loops significantly worsen downstream performance. Additionally restricting training to only MLM further degrades performance, and produces a model with a near-equivalent performance profile to TCR-BERT.
Figure 4: SCEPTR embedding distances weight sequence similarity with respect to recombination biases.a) Scatter plot of SCEPTR and TCRdist distances between pairs of TCRs from the held-out test set of the pre-training dataset. The points are coloured according to a Gaussian kernel density estimate. b) Colouring TCR pairs instead by the minimal probability of generation $p_\textrm{gen}$ of the two TCRs as estimated by OLGA sethna2019olga suggests that SCEPTR embeddings locally contract regions of representation space that due to recombination biases are sparsely sampled. c) For sequence pairs judged to be similar by SCEPTR (distance $\in [0.98, 1.02]$), variations in $p_\textrm{gen}$ explain a substantial fraction of the variance in TCRdist, providing statistical evidence for the hypothesized weighting of sequence similarity with respect to the local density of sequences produced by VDJ recombination (see Fig. \ref{['fig:pgen_vs_tcrdist']} for the generality of this dependence across SCEPTR bins).
Figure 5: Supervised contrastive learning improves discrimination between pMHCs. Prediction performance as measured by AUROC on binary one-versus-rest classification for each of six pMHCs for different models. The fine-tuned model improves performance by exploiting the discriminative nature of the classification task.
...and 18 more figures

Contrastive learning of T cell receptor representations

TL;DR

Abstract

Contrastive learning of T cell receptor representations

Authors

TL;DR

Abstract

Table of Contents

Figures (23)