Zero Token-Driven Deep Thinking in LLMs: Unlocking the Full Potential of Existing Parameters via Cyclic Refinement
Guanghao Li, Wenhao Jiang, Li Shen, Ming Tang, Chun Yuan
TL;DR
The paper tackles improving large language models under fixed parameter budgets by proposing the Zero Token Transformer (ZTT), a framework that combines head-tail decoupled cyclic parameter sharing, a learnable Zero Token mechanism, and a dynamic, attention-driven early-exit strategy. By keeping the first and last layers fixed and cycling only intermediate layers, and by inserting a Zero Token into each attention block to guide cycle-specific updates, ZTT enables adaptive computation and reduces redundant processing. Empirical results show that ZTT outperforms traditional cycling and baseline approaches in both training-from-scratch and fine-tuning settings across multiple NLP benchmarks, with an effective early-exit mechanism driven by Zero Token attention. The method offers practical advantages for resource-constrained deployments by achieving deeper reasoning under tight parameter budgets and scalable applicability to large pre-trained models.
Abstract
Resource limitations often constrain the parameter counts of Large Language Models (LLMs), hindering their performance. While existing methods employ parameter sharing to reuse the same parameter set under fixed budgets, such approaches typically force each layer to assume multiple roles with a predetermined number of iterations, restricting efficiency and adaptability. In this work, we propose the Zero Token Transformer (ZTT), which features a head-tail decoupled parameter cycling method. We disentangle the first (head) and last (tail) layers from parameter cycling and iteratively refine only the intermediate layers. Furthermore, we introduce a Zero-Token Mechanism, an internal architectural component rather than an input token, to guide layer-specific computation. At each cycle, the model retrieves a zero token (with trainable key values) from a Zero-Token Pool, integrating it alongside regular tokens in the attention mechanism. The corresponding attention scores not only reflect each layer's computational importance but also enable dynamic early exits without sacrificing overall model accuracy. Our approach achieves superior performance under tight parameter budgets, effectively reduces computational overhead via early exits, and can be readily applied to fine-tune existing pre-trained models for enhanced efficiency and adaptability.
