Limits on Inferring T-cell Specificity from Partial Information
James Henderson, Yuta Nagano, Martina Milighetti, Andreas Tiffeau-Mayer
TL;DR
This work introduces a top-down, coincidence-based information-theoretic framework to bound how well TCR antigen specificity can be inferred from partial sequence information. By defining features of TCR sequences and measuring their informativeness via coincidence entropy $H_2$ and coincidence mutual information $I_2$, the authors quantify how much each region (notably the $\beta$ chain and CDR3 segments) and even physical properties contribute to specificity, while revealing pervasive synergy and some redundancy across features. They derive exact Bayes-based bounds on classification performance with partial information, demonstrate these bounds on real TCR data (including SARS-CoV-2 epitopes), and show that a mixture-of-motifs model can explain epitope-specific variability in feature relevancy and interaction information. The framework also extends to fuzzy matching, enabling principled assessment of near-coincidence classification, which has direct implications for designing efficient sequencing strategies and developing interpretable ML models for TCR specificity prediction and therapeutic optimization. Overall, the work provides rigorous benchmarks and a versatile toolkit for understanding and leveraging partial TCR information in immunodiagnostics and cell therapies.
Abstract
A key challenge in molecular biology is to decipher the mapping of protein sequence to function. To perform this mapping requires the identification of sequence features most informative about function. Here, we quantify the amount of information (in bits) that T-cell receptor (TCR) sequence features provide about antigen specificity. We identify informative features by their degree of conservation among antigen-specific receptors relative to null expectations. We find that TCR specificity synergistically depends on the hypervariable regions of both receptor chains, with a degree of synergy that strongly depends on the ligand. Using a coincidence-based approach to measuring information enables us to directly bound the accuracy with which TCR specificity can be predicted from partial matches to reference sequences. We anticipate that our statistical framework will be of use for developing machine learning models for TCR specificity prediction and for optimizing TCRs for cell therapies. The proposed coincidence-based information measures might find further applications in bounding the performance of pairwise classifiers in other fields.
