Table of Contents
Fetching ...

DND: Boosting Large Language Models with Dynamic Nested Depth

Tieyuan Chen, Xiaodong Chen, Haoxing Chen, Zhenzhong Lan, Weiyao Lin, Jianguo Li

TL;DR

Dynamic Nested Depth (DND) introduces token-wise adaptive deepening by selecting critical tokens for an additional processing pass within mid-layers of pretrained LLMs. A token-choice router, threshold control, and a gated fusion mechanism enable efficient, post-training integration into dense and MoE architectures, achieving measurable accuracy gains with modest FLOP overhead. The method is stabilized by a router-controlling loss (score dispersion and distribution preservation) and a threshold control scheme (buffer proportional control with EMA synchronization), ensuring reliable token selection. Empirically, DND yields notable improvements across General Knowledge, Math/ STEM, and Coding benchmarks on Qwen3-1.7B and Qwen3-30B-A3B, with minimal parameter increase and limited speed impact, highlighting its practical potential for enhancing LLM performance without full-scale retraining.

Abstract

We introduce Dynamic Nested Depth (DND), a novel method that improves performance for off-the-shelf LLMs by selecting critical tokens to reprocess in a nested depth manner. Specifically, at the end of the given transformer layer, DND identifies more critical tokens with a router and feeds them back for an extra round of processing, effectively ``reviewing" difficult tokens while avoiding redundant computation for easier ones. The dynamic selection mechanism is tailored for precise control via two novel strategies: a router controlling loss to enhance token selection distinguishability, and a threshold control scheme to ensure selection stability. We demonstrate the effectiveness of DND by directly integrating it into pre-trained dense and MoE models during a post-training phase. On diverse benchmarks, this approach boosts the performances of the dense Qwen3-1.7B by 1.88% and the MoE Qwen3-30B-A3B by 0.87%, all with a minimal parameter and computing increase.

DND: Boosting Large Language Models with Dynamic Nested Depth

TL;DR

Dynamic Nested Depth (DND) introduces token-wise adaptive deepening by selecting critical tokens for an additional processing pass within mid-layers of pretrained LLMs. A token-choice router, threshold control, and a gated fusion mechanism enable efficient, post-training integration into dense and MoE architectures, achieving measurable accuracy gains with modest FLOP overhead. The method is stabilized by a router-controlling loss (score dispersion and distribution preservation) and a threshold control scheme (buffer proportional control with EMA synchronization), ensuring reliable token selection. Empirically, DND yields notable improvements across General Knowledge, Math/ STEM, and Coding benchmarks on Qwen3-1.7B and Qwen3-30B-A3B, with minimal parameter increase and limited speed impact, highlighting its practical potential for enhancing LLM performance without full-scale retraining.

Abstract

We introduce Dynamic Nested Depth (DND), a novel method that improves performance for off-the-shelf LLMs by selecting critical tokens to reprocess in a nested depth manner. Specifically, at the end of the given transformer layer, DND identifies more critical tokens with a router and feeds them back for an extra round of processing, effectively ``reviewing" difficult tokens while avoiding redundant computation for easier ones. The dynamic selection mechanism is tailored for precise control via two novel strategies: a router controlling loss to enhance token selection distinguishability, and a threshold control scheme to ensure selection stability. We demonstrate the effectiveness of DND by directly integrating it into pre-trained dense and MoE models during a post-training phase. On diverse benchmarks, this approach boosts the performances of the dense Qwen3-1.7B by 1.88% and the MoE Qwen3-30B-A3B by 0.87%, all with a minimal parameter and computing increase.

Paper Structure

This paper contains 33 sections, 11 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: DND Motivation. The tokens highlighted in red denote critical elements in the QA pair. We propose a strategy within the transformer layers to identify and allocate additional computation to these critical tokens. $\mathbf{L}_s$ and $\mathbf{L}_e$ indicate the starting and end layers that adopted this strategy.
  • Figure 2: DND Framework. The central idea of DND is a dynamic nested pass of critical tokens after the vanilla forward process of the transformer layers. Whether a token is selected or not is determined by a router. The block's final output is a merged result of vanilla output and nested output, governed by normalized routing weights.
  • Figure 3: Routing Design and Training Strategies. Figure (a) illustrates expert-choice routing, where the top-k proportion is selected over the entire sequence. Figure (b) shows token-choice routing, which selects tokens independently and suits auto-regressive models. Figure (c) summarizes our training strategy: routing outputs are optimized to enhance token distinguishability by dispersing the token-level routing distribution via $\mathcal{L}_{\text{sd}}$ and preventing it from collapsing into gradient-vanishing regions via $\mathcal{L}_{\text{dp}}$. In addition, buffer proportional control (Eq. (9)) and EMA synchronization (Eq. (10)) effectively regulate the stability of the selection by computing the real-time error ratio.
  • Figure 4: Threshold Adjustment during Training. With EMA synchronization, the threshold can be adjusted smoothly and in real time.
  • Figure 5: Analysis of the Relationship between DND’s Selection Preference and the Magnitude of Hidden State Changes across Transformer Layers. DND exhibits a stronger preference for selecting tokens whose representations undergo larger changes after passing through a given layer.
  • ...and 5 more figures