Table of Contents
Fetching ...

TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters

Haiyang Wang, Yue Fan, Muhammad Ferjad Naeem, Yongqin Xian, Jan Eric Lenssen, Liwei Wang, Federico Tombari, Bernt Schiele

TL;DR

TokenFormer tackles the heavy costs of scaling Transformer models by reframing all linear projections as attention-based interactions between input tokens and parameter tokens, using a Pattention layer that treats model parameters as tokens. By appending new parameter tokens rather than changing channel dimensions, the model scales from 124M to 1.4B in a progressive, reuse-friendly manner, achieving performance close to (or better than) models trained from scratch while reducing cumulative training costs. The approach is validated across language and vision benchmarks, showing competitive perplexities and zero-shot results, and comparable ImageNet performance, with ablations demonstrating the benefits of GeLU+$L_2$ normalization and zero initialization for scalable growth. The work suggests a broader paradigm of tokenizing everything and leveraging attention for scalable, interpretable, and potentially MoE-inspired architectures, with practical implications for efficient foundation-model development.

Abstract

Transformers have become the predominant architecture in foundation models due to their excellent performance across various domains. However, the substantial cost of scaling these models remains a significant concern. This problem arises primarily from their dependence on a fixed number of parameters within linear projections. When architectural modifications (e.g., channel dimensions) are introduced, the entire model typically requires retraining from scratch. As model sizes continue growing, this strategy results in increasingly high computational costs and becomes unsustainable. To overcome this problem, we introduce TokenFormer, a natively scalable architecture that leverages the attention mechanism not only for computations among input tokens but also for interactions between tokens and model parameters, thereby enhancing architectural flexibility. By treating model parameters as tokens, we replace all the linear projections in Transformers with our token-parameter attention layer, where input tokens act as queries and model parameters as keys and values. This reformulation allows for progressive and efficient scaling without necessitating retraining from scratch. Our model scales from 124M to 1.4B parameters by incrementally adding new key-value parameter pairs, achieving performance comparable to Transformers trained from scratch while greatly reducing training costs. Code and models are available at https://github.com/Haiyang-W/TokenFormer.

TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters

TL;DR

TokenFormer tackles the heavy costs of scaling Transformer models by reframing all linear projections as attention-based interactions between input tokens and parameter tokens, using a Pattention layer that treats model parameters as tokens. By appending new parameter tokens rather than changing channel dimensions, the model scales from 124M to 1.4B in a progressive, reuse-friendly manner, achieving performance close to (or better than) models trained from scratch while reducing cumulative training costs. The approach is validated across language and vision benchmarks, showing competitive perplexities and zero-shot results, and comparable ImageNet performance, with ablations demonstrating the benefits of GeLU+ normalization and zero initialization for scalable growth. The work suggests a broader paradigm of tokenizing everything and leveraging attention for scalable, interpretable, and potentially MoE-inspired architectures, with practical implications for efficient foundation-model development.

Abstract

Transformers have become the predominant architecture in foundation models due to their excellent performance across various domains. However, the substantial cost of scaling these models remains a significant concern. This problem arises primarily from their dependence on a fixed number of parameters within linear projections. When architectural modifications (e.g., channel dimensions) are introduced, the entire model typically requires retraining from scratch. As model sizes continue growing, this strategy results in increasingly high computational costs and becomes unsustainable. To overcome this problem, we introduce TokenFormer, a natively scalable architecture that leverages the attention mechanism not only for computations among input tokens but also for interactions between tokens and model parameters, thereby enhancing architectural flexibility. By treating model parameters as tokens, we replace all the linear projections in Transformers with our token-parameter attention layer, where input tokens act as queries and model parameters as keys and values. This reformulation allows for progressive and efficient scaling without necessitating retraining from scratch. Our model scales from 124M to 1.4B parameters by incrementally adding new key-value parameter pairs, achieving performance comparable to Transformers trained from scratch while greatly reducing training costs. Code and models are available at https://github.com/Haiyang-W/TokenFormer.

Paper Structure

This paper contains 23 sections, 25 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Traditionally, large transformer architectures are trained from scratch without reusing previous smaller-scale models (represented by blue dots on the left). In this paper, we propose a novel fully attention-based architecture that allows scaling model incrementally, thus greatly reducing the overall cost of training large transformer architectures (depicted by red dots on the left). The right panel delineates a comparison between conventional Transformer and our Tokenformer.
  • Figure 2: Tokenformer is a fully attention-driven architecture featuring a new token-Parameter attention (Pattention) layer. The Pattention uses a set of learnable tokens to represent model parameters and lets the input tokens attend to them. As the model scales, Tokenformer adds new learnable tokens to expand the existing key-value parameter sets, while keeping the feature dimension constant and leaving the rest of the computation unaffected.
  • Figure 3: Evaluating model scaling costs through cumulative computational budgets. The Transformer baseline incurs expenses for each individual scaling step performed independently from scratch, whereas Tokenformer aggregates costs across all scaling stages, including training a 124M model initially, progressively scaling to 354M, 757M, and 1.4B parameters.
  • Figure 4: Evaluating model scaling costs by measuring the budget required at each scaling stage. The Transformer baselines used are consistent with those depicted in Figure \ref{['fig:scaling_accum']}, trained with 30B and 300B tokens. Similarly, for Tokenformer, the cost is the budget required for each incremental scaling step from a smaller one. All the experiments were conducted on TPU v4 hardware.
  • Figure 5: The relationship between FLOPs and text length for both Transformer and Tokenformer. As shown in Table \ref{['tab:flops_and_param']}, Transformer exhibits an increase in computational cost for token-token interactions as $d_{\text{model}}$ scales upwards. Our Tokenformer model, however, offers a flexible parameter scaling mechanism that maintains $d_{\text{token}}$ at a constant value. This strategy results in controllable computational costs for token-token interactions and markedly enhances the efficiency of long-text modeling.
  • ...and 2 more figures