The Inductive Bottleneck: Data-Driven Emergence of Representational Sparsity in Vision Transformers
Kanishk Awadhiya
TL;DR
Vision Transformers lack fixed hierarchical inductive biases yet exhibit a data-driven Inductive Bottleneck, a U-shaped entropy profile across layers. The authors introduce Effective Encoding Dimension and Spectral Entropy to quantify layer-wise representational capacity and test ViT-Small models trained with DINO on datasets with varying compositional complexity. They show object-centric data induce deeper middle-layer compression while texture-centric data preserve high rank, with final layers re-expanding to support classification, revealing ViTs as dynamic hierarchies that adapt their internal representations to data statistics. These findings suggest practical avenues for targeted spectral pruning and deeper understanding of generalization in self-supervised Vision Transformers.
Abstract
Vision Transformers (ViTs) lack the hierarchical inductive biases inherent to Convolutional Neural Networks (CNNs), theoretically allowing them to maintain high-dimensional representations throughout all layers. However, recent observations suggest ViTs often spontaneously manifest a "U-shaped" entropy profile-compressing information in middle layers before expanding it for the final classification. In this work, we demonstrate that this "Inductive Bottleneck" is not an architectural artifact, but a data-dependent adaptation. By analyzing the layer-wise Effective Encoding Dimension (EED) of DINO-trained ViTs across datasets of varying compositional complexity (UC Merced, Tiny ImageNet, and CIFAR-100), we show that the depth of the bottleneck correlates strongly with the semantic abstraction required by the task. We find that while texture-heavy datasets preserve high-rank representations throughout, object-centric datasets drive the network to dampen high-frequency information in middle layers, effectively "learning" a bottleneck to isolate semantic features.
