Table of Contents
Fetching ...

Fewer Tokens, Greater Scaling: Self-Adaptive Visual Bases for Efficient and Expansive Representation Learning

Shawn Young, Xingyu Zeng, Lijian Xu

TL;DR

This work investigates how semantic redundancy in visual tokens constrains scaling of vision transformers and proposes an MDL-based framework to learn a compact, orthogonal basis set that spans image semantics. The core contribution is the Orthogonal Filtering module, consisting of an allocator and a slot-based basis representation, guided by an orthogonality loss to achieve low-rank, semantically disentangled reconstructions. A key empirical finding is the Law of Parametric Efficiency Priority, which shows that larger models require markedly fewer image bases to reach the semantic ceiling, enabling substantial token-efficiency gains and reduced compute. The authors also introduce PaperScope, a large visual long-context dataset of 17,365 high-resolution papers to facilitate research on long-range visual understanding and token budgeting at scale.

Abstract

This paper investigates the fundamental relationship between model capacity and the minimal number of visual tokens required to preserve image semantics. Inspired by the Minimum Description Length principle, we reinterpret image tokens as vectors in a visual semantic space and define the intrinsic semantic complexity of an image as the smallest set of basis vectors needed to span this space. Building on this perspective, we propose Orthogonal Filtering, a lightweight module that adaptively clusters redundant tokens into a compact set of orthogonal bases. Through extensive experiments across a range of ViT models, we reveal a consistent token, model scaling law: larger models require significantly fewer tokens to span visual semantic space. Besides, we also contribute a visual long-context dataset.

Fewer Tokens, Greater Scaling: Self-Adaptive Visual Bases for Efficient and Expansive Representation Learning

TL;DR

This work investigates how semantic redundancy in visual tokens constrains scaling of vision transformers and proposes an MDL-based framework to learn a compact, orthogonal basis set that spans image semantics. The core contribution is the Orthogonal Filtering module, consisting of an allocator and a slot-based basis representation, guided by an orthogonality loss to achieve low-rank, semantically disentangled reconstructions. A key empirical finding is the Law of Parametric Efficiency Priority, which shows that larger models require markedly fewer image bases to reach the semantic ceiling, enabling substantial token-efficiency gains and reduced compute. The authors also introduce PaperScope, a large visual long-context dataset of 17,365 high-resolution papers to facilitate research on long-range visual understanding and token budgeting at scale.

Abstract

This paper investigates the fundamental relationship between model capacity and the minimal number of visual tokens required to preserve image semantics. Inspired by the Minimum Description Length principle, we reinterpret image tokens as vectors in a visual semantic space and define the intrinsic semantic complexity of an image as the smallest set of basis vectors needed to span this space. Building on this perspective, we propose Orthogonal Filtering, a lightweight module that adaptively clusters redundant tokens into a compact set of orthogonal bases. Through extensive experiments across a range of ViT models, we reveal a consistent token, model scaling law: larger models require significantly fewer tokens to span visual semantic space. Besides, we also contribute a visual long-context dataset.

Paper Structure

This paper contains 19 sections, 1 theorem, 8 equations, 2 figures, 10 tables, 1 algorithm.

Key Result

Theorem 2.1

Let $\mathcal{H}$ be a hypothesis class of visual representations induced by the factorization module, and $h = (A,B)\in \mathcal{H}$ denotes a specific instantiation characterized by the assignment matrix $A$ and the basis matrix $B$. Let $d: \mathcal{H}\rightarrow \{0,1\}^{*}$ represent a prefix-f where $S\sim \mathcal{D}^{m}$ has probability at least $1-\delta$ over rhe sampled training set. Th

Figures (2)

  • Figure 1: Larger models reach their performance upper bound with fewer tokens. (a) Reconstruction results across different visible ratios and model capacities. PSNR reflects the performance at each visible ratio for a fixed model size, where red indicates improvement compared to the previous ratio and green denotes stability. (b) Tokens required for models of different capacities to reach the upper performance bound. An empirical performance upper bound is delineated with a grey line.
  • Figure 2: (a) A simple and lightweight orthogonal filter module precedes the visual backbone to construct orthogonal bases for visual representation. Each image token is allocated to a unique slot where tokens in the same slot are weighted fused, while empty slots are filled with random noise. The allocator is responsible for extracting the visual bases guided by the orthogonality loss, while the slots merely normalize and fuse the assigned image tokens without performing additional feature extraction. (b) Structural Differences with MoE: Unlike MoE, where tokens are distributed across multiple experts, our allocator groups semantically similar tokens into slots that collectively form the orthogonal bases. Consequently, MoE may contain inactive experts, whereas our design substitutes missing tokens with random noise. In addition, the number of output tokens corresponds to the number of slots rather than the input tokens, and our slots omit FFN layers used by MoE experts for feature extraction.

Theorems & Definitions (2)

  • Theorem 2.1
  • Definition 2.1