Table of Contents
Fetching ...

Universal YOCO for Efficient Depth Scaling

Yutao Sun, Li Dong, Tianzhu Ye, Shaohan Huang, Jianyong Wang, Furu Wei

Abstract

The rise of test-time scaling has remarkably boosted the reasoning and agentic proficiency of Large Language Models (LLMs). Yet, standard Transformers struggle to scale inference-time compute efficiently, as conventional looping strategies suffer from high computational overhead and a KV cache that inflates alongside model depth. We present Universal YOCO (YOCO-U), which combines the YOCO decoder-decoder architecture with recursive computation to achieve a synergistic effect greater than either alone. Built on the YOCO framework, YOCO-U implements a Universal Self-Decoder that performs multiple iterations via parameter sharing, while confining the iterative process to shallow, efficient-attention layers. This combination yields a favorable capability-efficiency tradeoff that neither YOCO nor recursion achieves independently. The YOCO architecture provides a constant global KV cache and linear pre-filling, while partial recursion enhances representational depth with limited overhead. Together, YOCO-U improves token utility and scaling behavior while maintaining efficient inference. Empirical results confirm that YOCO-U remains highly competitive in general and long-context benchmarks, demonstrating that the integration of efficient-attention architectures and recursive computation is a promising direction for scalable LLMs.

Universal YOCO for Efficient Depth Scaling

Abstract

The rise of test-time scaling has remarkably boosted the reasoning and agentic proficiency of Large Language Models (LLMs). Yet, standard Transformers struggle to scale inference-time compute efficiently, as conventional looping strategies suffer from high computational overhead and a KV cache that inflates alongside model depth. We present Universal YOCO (YOCO-U), which combines the YOCO decoder-decoder architecture with recursive computation to achieve a synergistic effect greater than either alone. Built on the YOCO framework, YOCO-U implements a Universal Self-Decoder that performs multiple iterations via parameter sharing, while confining the iterative process to shallow, efficient-attention layers. This combination yields a favorable capability-efficiency tradeoff that neither YOCO nor recursion achieves independently. The YOCO architecture provides a constant global KV cache and linear pre-filling, while partial recursion enhances representational depth with limited overhead. Together, YOCO-U improves token utility and scaling behavior while maintaining efficient inference. Empirical results confirm that YOCO-U remains highly competitive in general and long-context benchmarks, demonstrating that the integration of efficient-attention architectures and recursive computation is a promising direction for scalable LLMs.

Paper Structure

This paper contains 40 sections, 6 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Overview of the YOCO-U architecture. Universal Self-Decoder (bottom) performs recursive computation ($T$ iterations, indicated by the red dashed line) using efficient self-attention (local windows) to refine representations. Crucially, the global KV cache produced for cross-attention is generated once and remains constant regardless of $T$, while only the local window-based KV caches grow with iterations, keeping overall cache overhead negligible. Cross-Decoder (top) reuses this shared global KV cache via cross-attention for autoregressive token prediction.
  • Figure 2: Scaling behavior of language modeling loss with the same model size. (Left) Loss versus training FLOPs: YOCO-U achieves competitive or lower loss ($\Delta L{=}0.033$) at the same FLOPs budget, while incurring negligible KV cache overhead. (Right) Loss versus training tokens: YOCO-U also improves data efficiency, requiring approximately 62% fewer tokens to reach comparable performance.
  • Figure 3: Accuracy comparison on 11 math benchmarks. YOCO-U consistently outperforms the YOCO baseline across all tasks, achieving a significant boost in average accuracy.
  • Figure 4: Long sequence perplexity decreases along with the increasing input length on Book (top) and Code (bottom) data. YOCO-U consistently achieves lower perplexity compared to non-recursive baselines, i.e., Transformer and YOCO. YOCO-U also maintains parity with the heavier recursive baseline (RINS), indicating effective utilization of long-range context.
  • Figure 5: Parameter scaling properties. We keep the training steps the same. Left: YOCO-U achieves comparable performance with 50% fewer parameters than YOCO. Right: YOCO-U demonstrates scalable parameter utility as the activated parameter count increases.
  • ...and 3 more figures