Table of Contents
Fetching ...

The Inductive Bottleneck: Data-Driven Emergence of Representational Sparsity in Vision Transformers

Kanishk Awadhiya

TL;DR

Vision Transformers lack fixed hierarchical inductive biases yet exhibit a data-driven Inductive Bottleneck, a U-shaped entropy profile across layers. The authors introduce Effective Encoding Dimension and Spectral Entropy to quantify layer-wise representational capacity and test ViT-Small models trained with DINO on datasets with varying compositional complexity. They show object-centric data induce deeper middle-layer compression while texture-centric data preserve high rank, with final layers re-expanding to support classification, revealing ViTs as dynamic hierarchies that adapt their internal representations to data statistics. These findings suggest practical avenues for targeted spectral pruning and deeper understanding of generalization in self-supervised Vision Transformers.

Abstract

Vision Transformers (ViTs) lack the hierarchical inductive biases inherent to Convolutional Neural Networks (CNNs), theoretically allowing them to maintain high-dimensional representations throughout all layers. However, recent observations suggest ViTs often spontaneously manifest a "U-shaped" entropy profile-compressing information in middle layers before expanding it for the final classification. In this work, we demonstrate that this "Inductive Bottleneck" is not an architectural artifact, but a data-dependent adaptation. By analyzing the layer-wise Effective Encoding Dimension (EED) of DINO-trained ViTs across datasets of varying compositional complexity (UC Merced, Tiny ImageNet, and CIFAR-100), we show that the depth of the bottleneck correlates strongly with the semantic abstraction required by the task. We find that while texture-heavy datasets preserve high-rank representations throughout, object-centric datasets drive the network to dampen high-frequency information in middle layers, effectively "learning" a bottleneck to isolate semantic features.

The Inductive Bottleneck: Data-Driven Emergence of Representational Sparsity in Vision Transformers

TL;DR

Vision Transformers lack fixed hierarchical inductive biases yet exhibit a data-driven Inductive Bottleneck, a U-shaped entropy profile across layers. The authors introduce Effective Encoding Dimension and Spectral Entropy to quantify layer-wise representational capacity and test ViT-Small models trained with DINO on datasets with varying compositional complexity. They show object-centric data induce deeper middle-layer compression while texture-centric data preserve high rank, with final layers re-expanding to support classification, revealing ViTs as dynamic hierarchies that adapt their internal representations to data statistics. These findings suggest practical avenues for targeted spectral pruning and deeper understanding of generalization in self-supervised Vision Transformers.

Abstract

Vision Transformers (ViTs) lack the hierarchical inductive biases inherent to Convolutional Neural Networks (CNNs), theoretically allowing them to maintain high-dimensional representations throughout all layers. However, recent observations suggest ViTs often spontaneously manifest a "U-shaped" entropy profile-compressing information in middle layers before expanding it for the final classification. In this work, we demonstrate that this "Inductive Bottleneck" is not an architectural artifact, but a data-dependent adaptation. By analyzing the layer-wise Effective Encoding Dimension (EED) of DINO-trained ViTs across datasets of varying compositional complexity (UC Merced, Tiny ImageNet, and CIFAR-100), we show that the depth of the bottleneck correlates strongly with the semantic abstraction required by the task. We find that while texture-heavy datasets preserve high-rank representations throughout, object-centric datasets drive the network to dampen high-frequency information in middle layers, effectively "learning" a bottleneck to isolate semantic features.

Paper Structure

This paper contains 21 sections, 7 equations, 1 figure, 1 table.

Figures (1)

  • Figure 1: Layer-wise Effective Encoding Dimension (EED) for ViT-Small on Tiny ImageNet. The profile clearly shows the "Inductive Bottleneck" structure: high rank at input, compression in the middle ($L_2$), and expansion at output ($L_{11}$).