Estimating the Effective Rank of Vision Transformers via Low-Rank Factorization
Liyu Zerihun
TL;DR
<3-5 sentence high-level summary> The paper introduces the concept of an effective rank region to quantify the intrinsic dimensionality of deep networks, treating rank as a measurement rather than a training objective. It combines low-rank factorization of a full-rank teacher with geometric distillation to map performance as a function of rank on ViT-B/32 trained on CIFAR-100, extracting an intrinsic dimensionality profile. The authors identify an effective rank band roughly from 16 to 34 with a knee near 31, achieving substantial parameter compression (≈11×) while maintaining most of the teacher's accuracy. This framework provides a practical tool for characterizing and comparing the intrinsic dimensionality of architectures and datasets, with implications for deployment and cross-model analyses.
Abstract
Deep networks are heavily over-parameterized, yet their learned representations often admit low-rank structure. We introduce a framework for estimating a model's intrinsic dimensionality by treating learned representations as projections onto a low-rank subspace of the model's full capacity. Our approach: train a full-rank teacher, factorize its weights at multiple ranks, and train each factorized student via distillation to measure performance as a function of rank. We define effective rank as a region, not a point: the smallest contiguous set of ranks for which the student reaches 85-95% of teacher accuracy. To stabilize estimates, we fit accuracy vs. rank with a monotone PCHIP interpolant and identify crossings of the normalized curve. We also define the effective knee as the rank maximizing perpendicular distance between the smoothed accuracy curve and its endpoint secant; an intrinsic indicator of where marginal gains concentrate. On ViT-B/32 fine-tuned on CIFAR-100 (one seed, due to compute constraints), factorizing linear blocks and training with distillation yields an effective-rank region of approximately [16, 34] and an effective knee at r* ~ 31. At rank 32, the student attains 69.46% top-1 accuracy vs. 73.35% for the teacher (~94.7% of baseline) while achieving substantial parameter compression. We provide a framework to estimate effective-rank regions and knees across architectures and datasets, offering a practical tool for characterizing the intrinsic dimensionality of deep models.
