Mechanisms of Non-Monotonic Scaling in Vision Transformers

Anantha Padmanaban Krishna Kumar

Mechanisms of Non-Monotonic Scaling in Vision Transformers

Anantha Padmanaban Krishna Kumar

TL;DR

This work investigates why deeper Vision Transformers sometimes underperform shallower ones on ImageNet. It identifies a universal Cliff-Plateau-Climb pattern in ViT representations and introduces the Information Scrambling Index ($ISI$) to quantify how information mixes across tokens during depth. The study shows that best geometry arises from progressive hub marginalization of the [CLS] token and distributed patch-token consensus, with ViT-B achieving superior Neural Collapse geometry at a shallower depth than ViT-L due to more controlled information flow. Collectively, the findings argue for depth-aware transformer design that emphasizes calibrated depth, phase transitions in information flow, and diagnostic tools like $ISI$, Attention Consensus Index ($ACI$), and CLS Centrality ($CCC$) to guide architecture choices.

Abstract

Deeper Vision Transformers often perform worse than shallower ones, which challenges common scaling assumptions. Through a systematic empirical analysis of ViT-S, ViT-B, and ViT-L on ImageNet, we identify a consistent three-phase Cliff-Plateau-Climb pattern that governs how representations evolve with depth. We observe that better performance is associated with progressive marginalization of the [CLS] token, originally designed as a global aggregation hub, in favor of distributed consensus among patch tokens. We quantify patterns of information mixing with an Information Scrambling Index, and show that in ViT-L the information-task tradeoff emerges roughly 10 layers later than in ViT-B, and that these additional layers correlate with increased information diffusion rather than improved task performance. Taken together, these results suggest that transformer architectures in this regime may benefit more from carefully calibrated depth that executes clean phase transitions than from simply increasing parameter count. The Information Scrambling Index provides a useful diagnostic for existing models and suggests a potential design target for future architectures. All code is available at: https://github.com/AnanthaPadmanaban-KrishnaKumar/Cliff-Plateau-Climb.

Mechanisms of Non-Monotonic Scaling in Vision Transformers

TL;DR

Abstract

Mechanisms of Non-Monotonic Scaling in Vision Transformers

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)