Table of Contents
Fetching ...

From Words to Amino Acids: Does the Curse of Depth Persist?

Aleena Siji, Amir Mohammad Karimi Mamaghan, Ferdinand Kapl, Tobias Höppe, Emmanouil Angelis, Andrea Dittadi, Maurice Brenner, Michael Heinzinger, Karl Henrik Johansson, Kaitlin Maile, Johannes von Oswald, Stefan Bauer

TL;DR

A depth analysis of six popular PLMs is presented, spanning three training objectives, and consistent depth-dependent patterns are observed that suggest PLMs exhibit a form of depth inefficiency, motivating future work on more depth-efficient architectures and training methods.

Abstract

Protein language models (PLMs) have become widely adopted as general-purpose models, demonstrating strong performance in protein engineering and de novo design. Like large language models (LLMs), they are typically trained as deep transformers with next-token or masked-token prediction objectives on massive sequence corpora and are scaled by increasing model depth. Recent work on autoregressive LLMs has identified the Curse of Depth: later layers contribute little to the final output predictions. These findings naturally raise the question of whether a similar depth inefficiency also appears in PLMs, where many widely used models are not autoregressive, and some are multimodal, accepting both protein sequence and structure as input. In this work, we present a depth analysis of six popular PLMs across model families and scales, spanning three training objectives, namely autoregressive, masked, and diffusion, and quantify how layer contributions evolve with depth using a unified set of probing- and perturbation-based measurements. Across all models, we observe consistent depth-dependent patterns that extend prior findings on LLMs: later layers depend less on earlier computations and mainly refine the final output distribution, and these effects are increasingly pronounced in deeper models. Taken together, our results suggest that PLMs exhibit a form of depth inefficiency, motivating future work on more depth-efficient architectures and training methods.

From Words to Amino Acids: Does the Curse of Depth Persist?

TL;DR

A depth analysis of six popular PLMs is presented, spanning three training objectives, and consistent depth-dependent patterns are observed that suggest PLMs exhibit a form of depth inefficiency, motivating future work on more depth-efficient architectures and training methods.

Abstract

Protein language models (PLMs) have become widely adopted as general-purpose models, demonstrating strong performance in protein engineering and de novo design. Like large language models (LLMs), they are typically trained as deep transformers with next-token or masked-token prediction objectives on massive sequence corpora and are scaled by increasing model depth. Recent work on autoregressive LLMs has identified the Curse of Depth: later layers contribute little to the final output predictions. These findings naturally raise the question of whether a similar depth inefficiency also appears in PLMs, where many widely used models are not autoregressive, and some are multimodal, accepting both protein sequence and structure as input. In this work, we present a depth analysis of six popular PLMs across model families and scales, spanning three training objectives, namely autoregressive, masked, and diffusion, and quantify how layer contributions evolve with depth using a unified set of probing- and perturbation-based measurements. Across all models, we observe consistent depth-dependent patterns that extend prior findings on LLMs: later layers depend less on earlier computations and mainly refine the final output distribution, and these effects are increasingly pronounced in deeper models. Taken together, our results suggest that PLMs exhibit a form of depth inefficiency, motivating future work on more depth-efficient architectures and training methods.
Paper Structure (38 sections, 36 figures)

This paper contains 38 sections, 36 figures.

Figures (36)

  • Figure 1: Maximum propagated effect of skipping each layer on future-token computations in ESM2. Here, “future” refers to a held-out subset of non-intervened masked tokens. Even at 35M, skipping later layers produces relatively weak propagated effects compared to skipping early layers. From 150M onward, this separation becomes clear: a substantial fraction of late layers can be skipped with only minor changes in subsequent computations on future tokens. This pattern strengthens with scale, indicating that downstream sensitivity increasingly concentrates in early-to-middle layers, while later layers mainly refine the final prediction. We also observe localized low-effect regions among early layers, suggesting that not all early layers contribute equally. This aligns with the stage-wise view of lad2024remarkable and suggests that depth is organized into multiple inference stages with weaker dependencies across certain layer ranges.
  • Figure 2: Maximum change in ESM2 output probabilities under layer skipping, restricted to future tokens only. Here, “future tokens” refers to a held-out subset of non-intervened masked tokens. The effect generally decreases with depth, indicating that later layers tend to induce smaller output changes and mainly provide incremental refinement. We also observe localized low-effect regions among early layers, consistent with the stage-like patterns seen in \ref{['fig:skiplayer_layers_ESM2']}.
  • Figure 3: KL divergence between the LogitLens layer-wise output distribution and the final output distribution for ESM2, plotted across depth. From 150M onward, the KL divergence decreases steadily toward later layers, indicating that deeper layers make increasingly incremental updates that bring the distribution closer to the final prediction. For the largest variants, the KL divergence is already relatively low in earlier layers, making the late-layer refinement phase less sharply separated. Overall, the trend remains consistent with later layers primarily refining an increasingly stable prediction.
  • Figure 4: Top-1 overlap between the layer-wise prediction and the full-model prediction for ESM2 across depth. Across ESM2 variants, agreement is low in early layers and increases toward the end of the network. The increase is most pronounced in the final layers, especially for larger models, where it rises gradually through mid-depth and then more sharply near the final layers.
  • Figure 5: Layer-wise ProteinGym performance for ESM2. Average Spearman correlation as a function of relative depth, normalized to $[0,1]$, where predictions are taken from each layer. Performance improves with depth for all model sizes, but the largest models exhibit diminishing returns in the final layers, suggesting that earlier layers already capture much of the signal and later layers mainly provide small refinements to the final predictions.
  • ...and 31 more figures