Table of Contents
Fetching ...

Zero Token-Driven Deep Thinking in LLMs: Unlocking the Full Potential of Existing Parameters via Cyclic Refinement

Guanghao Li, Wenhao Jiang, Li Shen, Ming Tang, Chun Yuan

TL;DR

The paper tackles improving large language models under fixed parameter budgets by proposing the Zero Token Transformer (ZTT), a framework that combines head-tail decoupled cyclic parameter sharing, a learnable Zero Token mechanism, and a dynamic, attention-driven early-exit strategy. By keeping the first and last layers fixed and cycling only intermediate layers, and by inserting a Zero Token into each attention block to guide cycle-specific updates, ZTT enables adaptive computation and reduces redundant processing. Empirical results show that ZTT outperforms traditional cycling and baseline approaches in both training-from-scratch and fine-tuning settings across multiple NLP benchmarks, with an effective early-exit mechanism driven by Zero Token attention. The method offers practical advantages for resource-constrained deployments by achieving deeper reasoning under tight parameter budgets and scalable applicability to large pre-trained models.

Abstract

Resource limitations often constrain the parameter counts of Large Language Models (LLMs), hindering their performance. While existing methods employ parameter sharing to reuse the same parameter set under fixed budgets, such approaches typically force each layer to assume multiple roles with a predetermined number of iterations, restricting efficiency and adaptability. In this work, we propose the Zero Token Transformer (ZTT), which features a head-tail decoupled parameter cycling method. We disentangle the first (head) and last (tail) layers from parameter cycling and iteratively refine only the intermediate layers. Furthermore, we introduce a Zero-Token Mechanism, an internal architectural component rather than an input token, to guide layer-specific computation. At each cycle, the model retrieves a zero token (with trainable key values) from a Zero-Token Pool, integrating it alongside regular tokens in the attention mechanism. The corresponding attention scores not only reflect each layer's computational importance but also enable dynamic early exits without sacrificing overall model accuracy. Our approach achieves superior performance under tight parameter budgets, effectively reduces computational overhead via early exits, and can be readily applied to fine-tune existing pre-trained models for enhanced efficiency and adaptability.

Zero Token-Driven Deep Thinking in LLMs: Unlocking the Full Potential of Existing Parameters via Cyclic Refinement

TL;DR

The paper tackles improving large language models under fixed parameter budgets by proposing the Zero Token Transformer (ZTT), a framework that combines head-tail decoupled cyclic parameter sharing, a learnable Zero Token mechanism, and a dynamic, attention-driven early-exit strategy. By keeping the first and last layers fixed and cycling only intermediate layers, and by inserting a Zero Token into each attention block to guide cycle-specific updates, ZTT enables adaptive computation and reduces redundant processing. Empirical results show that ZTT outperforms traditional cycling and baseline approaches in both training-from-scratch and fine-tuning settings across multiple NLP benchmarks, with an effective early-exit mechanism driven by Zero Token attention. The method offers practical advantages for resource-constrained deployments by achieving deeper reasoning under tight parameter budgets and scalable applicability to large pre-trained models.

Abstract

Resource limitations often constrain the parameter counts of Large Language Models (LLMs), hindering their performance. While existing methods employ parameter sharing to reuse the same parameter set under fixed budgets, such approaches typically force each layer to assume multiple roles with a predetermined number of iterations, restricting efficiency and adaptability. In this work, we propose the Zero Token Transformer (ZTT), which features a head-tail decoupled parameter cycling method. We disentangle the first (head) and last (tail) layers from parameter cycling and iteratively refine only the intermediate layers. Furthermore, we introduce a Zero-Token Mechanism, an internal architectural component rather than an input token, to guide layer-specific computation. At each cycle, the model retrieves a zero token (with trainable key values) from a Zero-Token Pool, integrating it alongside regular tokens in the attention mechanism. The corresponding attention scores not only reflect each layer's computational importance but also enable dynamic early exits without sacrificing overall model accuracy. Our approach achieves superior performance under tight parameter budgets, effectively reduces computational overhead via early exits, and can be readily applied to fine-tune existing pre-trained models for enhanced efficiency and adaptability.

Paper Structure

This paper contains 37 sections, 4 equations, 2 figures, 8 tables.

Figures (2)

  • Figure 1: Left: A 6-layer vanilla transformer without cyclic processing. Center: A transformer with a simple two-cycle mechanism. Right: A two-cycle Zero Token Transformer, where the first and last layers do not participate in the cycling process. Each layer introduces an additional Zero Token. The rightmost part illustrates how the Zero Token is incorporated. Using the second layer as an example: the Zero Token is prepended to the sequence by aligning its key with the original tokens at the beginning, and an all-zero value is added in front of the value sequence. Placing the Zero Token at the beginning ensures that all subsequent tokens can effectively attend to it.
  • Figure 2: Comparison of model performance under equal computational complexity. (a) The effect of varying computational complexity, where 1L denotes the original model with a single layer, and increased complexity corresponds to repeated model calls. "Early exit" refers to adding a classification head after each cycle to train intermediate results. (b) On the left y-axis, the intermediate results of different models under the "early exit" condition when the total computational complexity is fixed at 15 (cycles $\times$ layers). On the right y-axis, the average attention values of other tokens to the Zero Token and the gate value at the output of the Zero Token Transformer.