Table of Contents
Fetching ...

Layer Probing Improves Kinase Functional Prediction with Protein Language Models

Ajit Kumar, IndraPrakash Jha

TL;DR

Protein language models often rely on final-layer embeddings, potentially missing informative signals encoded in earlier layers. This work systematically probes all 33 layers of ESM-2 for kinase domain classification, revealing that mid-to-late layers (20-33) provide superior unsupervised clustering and supervised accuracy (0.354 ARI and 75.7% accuracy) when combined with mean pooling and domain-aware embeddings. A principled layer-averaging framework, along with domain extraction, Platt calibration, and a reproducible benchmarking pipeline, yields a practical approach that outperforms motif-based methods and single-layer baselines. The study demonstrates depth-specific biological signals in transformer models and offers a generalizable strategy for improving protein function prediction across PLMs, with available data and code to support adoption.

Abstract

Protein language models (PLMs) have transformed sequence-based protein analysis, yet most applications rely only on final-layer embeddings, which may overlook biologically meaningful information encoded in earlier layers. We systematically evaluate all 33 layers of ESM-2 for kinase functional prediction using both unsupervised clustering and supervised classification. We show that mid-to-late transformer layers (layers 20-33) outperform the final layer by 32 percent in unsupervised Adjusted Rand Index and improve homology-aware supervised accuracy to 75.7 percent. Domain-level extraction, calibrated probability estimates, and a reproducible benchmarking pipeline further strengthen reliability. Our results demonstrate that transformer depth contains functionally distinct biological signals and that principled layer selection significantly improves kinase function prediction.

Layer Probing Improves Kinase Functional Prediction with Protein Language Models

TL;DR

Protein language models often rely on final-layer embeddings, potentially missing informative signals encoded in earlier layers. This work systematically probes all 33 layers of ESM-2 for kinase domain classification, revealing that mid-to-late layers (20-33) provide superior unsupervised clustering and supervised accuracy (0.354 ARI and 75.7% accuracy) when combined with mean pooling and domain-aware embeddings. A principled layer-averaging framework, along with domain extraction, Platt calibration, and a reproducible benchmarking pipeline, yields a practical approach that outperforms motif-based methods and single-layer baselines. The study demonstrates depth-specific biological signals in transformer models and offers a generalizable strategy for improving protein function prediction across PLMs, with available data and code to support adoption.

Abstract

Protein language models (PLMs) have transformed sequence-based protein analysis, yet most applications rely only on final-layer embeddings, which may overlook biologically meaningful information encoded in earlier layers. We systematically evaluate all 33 layers of ESM-2 for kinase functional prediction using both unsupervised clustering and supervised classification. We show that mid-to-late transformer layers (layers 20-33) outperform the final layer by 32 percent in unsupervised Adjusted Rand Index and improve homology-aware supervised accuracy to 75.7 percent. Domain-level extraction, calibrated probability estimates, and a reproducible benchmarking pipeline further strengthen reliability. Our results demonstrate that transformer depth contains functionally distinct biological signals and that principled layer selection significantly improves kinase function prediction.

Paper Structure

This paper contains 28 sections, 9 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Clustering performance (ARI) across ESM-2 layer selection strategies.
  • Figure 2: Confusion matrix for supervised classification across 8 kinase functional classes. Mid-layer averaged embeddings show high recall for most classes.
  • Figure 3: Classification performance across different homology identity thresholds (70\\
  • Figure 4: Effect of pooling strategy on performance. Mean pooling consistently outperforms CLS token across both clustering and classification.
  • Figure 5: Calibration curves before and after Platt scaling. Platt scaling reduces overconfidence and improves calibration.
  • ...and 1 more figures