Table of Contents
Fetching ...

Mechanisms of Non-Monotonic Scaling in Vision Transformers

Anantha Padmanaban Krishna Kumar

TL;DR

This work investigates why deeper Vision Transformers sometimes underperform shallower ones on ImageNet. It identifies a universal Cliff-Plateau-Climb pattern in ViT representations and introduces the Information Scrambling Index ($ISI$) to quantify how information mixes across tokens during depth. The study shows that best geometry arises from progressive hub marginalization of the [CLS] token and distributed patch-token consensus, with ViT-B achieving superior Neural Collapse geometry at a shallower depth than ViT-L due to more controlled information flow. Collectively, the findings argue for depth-aware transformer design that emphasizes calibrated depth, phase transitions in information flow, and diagnostic tools like $ISI$, Attention Consensus Index ($ACI$), and CLS Centrality ($CCC$) to guide architecture choices.

Abstract

Deeper Vision Transformers often perform worse than shallower ones, which challenges common scaling assumptions. Through a systematic empirical analysis of ViT-S, ViT-B, and ViT-L on ImageNet, we identify a consistent three-phase Cliff-Plateau-Climb pattern that governs how representations evolve with depth. We observe that better performance is associated with progressive marginalization of the [CLS] token, originally designed as a global aggregation hub, in favor of distributed consensus among patch tokens. We quantify patterns of information mixing with an Information Scrambling Index, and show that in ViT-L the information-task tradeoff emerges roughly 10 layers later than in ViT-B, and that these additional layers correlate with increased information diffusion rather than improved task performance. Taken together, these results suggest that transformer architectures in this regime may benefit more from carefully calibrated depth that executes clean phase transitions than from simply increasing parameter count. The Information Scrambling Index provides a useful diagnostic for existing models and suggests a potential design target for future architectures. All code is available at: https://github.com/AnanthaPadmanaban-KrishnaKumar/Cliff-Plateau-Climb.

Mechanisms of Non-Monotonic Scaling in Vision Transformers

TL;DR

This work investigates why deeper Vision Transformers sometimes underperform shallower ones on ImageNet. It identifies a universal Cliff-Plateau-Climb pattern in ViT representations and introduces the Information Scrambling Index () to quantify how information mixes across tokens during depth. The study shows that best geometry arises from progressive hub marginalization of the [CLS] token and distributed patch-token consensus, with ViT-B achieving superior Neural Collapse geometry at a shallower depth than ViT-L due to more controlled information flow. Collectively, the findings argue for depth-aware transformer design that emphasizes calibrated depth, phase transitions in information flow, and diagnostic tools like , Attention Consensus Index (), and CLS Centrality () to guide architecture choices.

Abstract

Deeper Vision Transformers often perform worse than shallower ones, which challenges common scaling assumptions. Through a systematic empirical analysis of ViT-S, ViT-B, and ViT-L on ImageNet, we identify a consistent three-phase Cliff-Plateau-Climb pattern that governs how representations evolve with depth. We observe that better performance is associated with progressive marginalization of the [CLS] token, originally designed as a global aggregation hub, in favor of distributed consensus among patch tokens. We quantify patterns of information mixing with an Information Scrambling Index, and show that in ViT-L the information-task tradeoff emerges roughly 10 layers later than in ViT-B, and that these additional layers correlate with increased information diffusion rather than improved task performance. Taken together, these results suggest that transformer architectures in this regime may benefit more from carefully calibrated depth that executes clean phase transitions than from simply increasing parameter count. The Information Scrambling Index provides a useful diagnostic for existing models and suggests a potential design target for future architectures. All code is available at: https://github.com/AnanthaPadmanaban-KrishnaKumar/Cliff-Plateau-Climb.

Paper Structure

This paper contains 32 sections, 6 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Layer-wise evolution of centered token similarity in Vision Transformers. Three distinct phases emerge consistently across model scales: initial decorrelation (Cliff, layers 0 and 1), extended low-similarity processing (Plateau, middle layers), and terminal re-correlation (Climb, the final few layers, approximately the last three to four). The Plateau duration scales with model depth while maintaining similar similarity ranges.
  • Figure 2: PE strength and model performance. ImageNet top-1 accuracy versus PE scaling factor $\alpha$. All models peak near $\alpha=1.0$, illustrating the functional importance of well-calibrated initial decorrelation.
  • Figure 3: Emergence of Neural Collapse in ViT-Base. Across the final layers, we observe sharp improvement in geometric optimality. NC1 (within-class variance) and NC2 (ETF gap) drop, while NC3 (classifier alignment) and NC4 (decision margin) rise, in a manner consistent with a rapid transition toward a collapsed state.
  • Figure 4: Information Plane Analysis. Panels (a), (b), and (c): all models trade off pre-PE patch structure (InfoX) against task signal but at different rates. ViT-B shows a sharp pivot around layer 8, whereas ViT-L changes more gradually up to about layer 18. Panels (d), (e), and (f): the Scrambling Index reveals their communication regimes. ViT-S exhibits communication collapse, ViT-B maintains controlled mixing, and ViT-L escalates into over-scrambling. Shaded regions mark pivot zones.
  • Figure 5: Coordinated reorganization in ViT-Base. During layers 8 to 10 (shaded), hub marginalization (CCC $\downarrow$) and consensus building (ACI $\uparrow$) coincide with improved geometric quality (NC2 $\downarrow$, inverted) and accuracy (NC4 $\uparrow$). Metrics normalized to the range $[0, 1]$.