Limits on Inferring T-cell Specificity from Partial Information

James Henderson; Yuta Nagano; Martina Milighetti; Andreas Tiffeau-Mayer

Limits on Inferring T-cell Specificity from Partial Information

James Henderson, Yuta Nagano, Martina Milighetti, Andreas Tiffeau-Mayer

TL;DR

This work introduces a top-down, coincidence-based information-theoretic framework to bound how well TCR antigen specificity can be inferred from partial sequence information. By defining features of TCR sequences and measuring their informativeness via coincidence entropy $H_2$ and coincidence mutual information $I_2$, the authors quantify how much each region (notably the $\beta$ chain and CDR3 segments) and even physical properties contribute to specificity, while revealing pervasive synergy and some redundancy across features. They derive exact Bayes-based bounds on classification performance with partial information, demonstrate these bounds on real TCR data (including SARS-CoV-2 epitopes), and show that a mixture-of-motifs model can explain epitope-specific variability in feature relevancy and interaction information. The framework also extends to fuzzy matching, enabling principled assessment of near-coincidence classification, which has direct implications for designing efficient sequencing strategies and developing interpretable ML models for TCR specificity prediction and therapeutic optimization. Overall, the work provides rigorous benchmarks and a versatile toolkit for understanding and leveraging partial TCR information in immunodiagnostics and cell therapies.

Abstract

A key challenge in molecular biology is to decipher the mapping of protein sequence to function. To perform this mapping requires the identification of sequence features most informative about function. Here, we quantify the amount of information (in bits) that T-cell receptor (TCR) sequence features provide about antigen specificity. We identify informative features by their degree of conservation among antigen-specific receptors relative to null expectations. We find that TCR specificity synergistically depends on the hypervariable regions of both receptor chains, with a degree of synergy that strongly depends on the ligand. Using a coincidence-based approach to measuring information enables us to directly bound the accuracy with which TCR specificity can be predicted from partial matches to reference sequences. We anticipate that our statistical framework will be of use for developing machine learning models for TCR specificity prediction and for optimizing TCRs for cell therapies. The proposed coincidence-based information measures might find further applications in bounding the performance of pairwise classifiers in other fields.

Limits on Inferring T-cell Specificity from Partial Information

TL;DR

and coincidence mutual information

, the authors quantify how much each region (notably the

chain and CDR3 segments) and even physical properties contribute to specificity, while revealing pervasive synergy and some redundancy across features. They derive exact Bayes-based bounds on classification performance with partial information, demonstrate these bounds on real TCR data (including SARS-CoV-2 epitopes), and show that a mixture-of-motifs model can explain epitope-specific variability in feature relevancy and interaction information. The framework also extends to fuzzy matching, enabling principled assessment of near-coincidence classification, which has direct implications for designing efficient sequencing strategies and developing interpretable ML models for TCR specificity prediction and therapeutic optimization. Overall, the work provides rigorous benchmarks and a versatile toolkit for understanding and leveraging partial TCR information in immunodiagnostics and cell therapies.

Abstract

Paper Structure (33 sections, 80 equations, 15 figures, 2 tables)

This paper contains 33 sections, 80 equations, 15 figures, 2 tables.

An information-theoretic approach to T cell specificity
Coincidence analysis for features
Coincidence entropy
Coincidence mutual information
Describing the interactions between features with redundancy and synergy
Bounding classification accuracy of partial TCR matches
Pairwise classification odds
When is partial information sufficient?
Application of the methodology to TCR sequence data
A decomposition of TCR specificity into its component parts
CDR3 length, net charge and glycine content as features
Synergy and redundancy between TCR features
Variability in interaction information across epitopes is explained by mixture models
Distance metrics and near-coincidence entropy
Generalization of coincidence mutual information to fuzzy matches
...and 18 more sections

Figures (15)

Figure 1: Overview of analysis methodology.a) Sketch of T-cell receptor structure highlighting the V, CDR3 and J regions and their interaction with MHC-bound peptides. The TCR is composed of two chains, most commonly $\alpha$ and $\beta$ chains. Each chain in turn is comprised of a V (variable), J (joining) and C (constant) gene, with the addition of a D (diversity) gene in the $\beta$ chain. Within each chain, the CDR1 and CDR2 amino acid loops are coded for by the V gene while the CDR3 regions are at the V(D)J intersection, which is additionally diversified through the random insertion and deletion of nucleotides at gene template junctions. b) An abstracted view of TCR sequence space. The set B includes all possible TCRs. The subsets S$_i$ represent TCRs specific to particular ligands. c) Sequencing TCR from either the whole repertoire or epitope-specific subsets gives us samples from their respective distributions. d) The number of pairs which match in a particular feature may then be recorded to compute a probability of coincidence. The logarithm of the probability of coincidence gives a measure of the entropy of the feature. Our information theoretic approach quantifies the change in entropy between background TCRs and sets of specific TCRs of different features (top to bottom). Features which experience a large reduction in entropy (bottom) are the most informative for predicting the epitope specificity of a sequence.
Figure 2: Coincidence mutual information between TCR sections and antigen specificity. Relevancy scores of various sections of the T-cell receptor sequence. The off-diagonal values indicate the amount of coincidence information that combinations of features provide. The top right hand grid shows the relevancy of combination of features where one is from the $\alpha$ chain and the other the $\beta$ chain. Interaction information and conditional mutual information between features can be computed by taking the difference between the off-diagonals and the sum of the corresponding diagonal values. In particular, positive interaction information is observed between the $\alpha$ and $\beta$ chains and the CDR3 and V regions indicating synergy between these features while negative interaction information is seen between the CDR3 and J regions indicating redundancy.
Figure 3: Coincidence mutual information between physical properties of the TCR sequence and antigen specificity. Relevancy scores of CDR3 length, CDR3 net charge and glycine content computed for the $\alpha$ and $\beta$ chains taken independently and combined. Although each feature has modest relevancy when considered independently, these features all display substantial synergy demonstrating how physical complementarity underlies overall chain pairing constraints.
Figure 4: Synergistic TCR sequence features. Interaction information scores for combinations of features computed from Figures \ref{['results_1']} and \ref{['results_length_charge_gly']}. Positive interaction information indicates that two features become more informative in the context of one another and hence have synergy.
Figure 5: Correlation between $\alpha$-$\beta$ interaction information and per-chain information across epitopes. Local interaction information and single chain information across epitopes. Weighted linear fits (solid lines) obtained using orthogonal distance regression were used to quantify the dependence between variables, with regression slopes $a$ displayed above each panel. Epitope-specific interaction information depends negatively on the local informational value of the a)$\alpha$ chain and b)$\beta$. We furthermore find that the c) per-chain relevancies are positively correlated with each other as is d,e) total information with both single chain relevancies. The observed dependencies between variables agree well with theoretical expectations from a mixture model (dashed lines), in which epitopes differ in the number of distinct binding solutions or contain false positives.
...and 10 more figures

Limits on Inferring T-cell Specificity from Partial Information

TL;DR

Abstract

Limits on Inferring T-cell Specificity from Partial Information

Authors

TL;DR

Abstract

Table of Contents

Figures (15)