DND: Boosting Large Language Models with Dynamic Nested Depth
Tieyuan Chen, Xiaodong Chen, Haoxing Chen, Zhenzhong Lan, Weiyao Lin, Jianguo Li
TL;DR
Dynamic Nested Depth (DND) introduces token-wise adaptive deepening by selecting critical tokens for an additional processing pass within mid-layers of pretrained LLMs. A token-choice router, threshold control, and a gated fusion mechanism enable efficient, post-training integration into dense and MoE architectures, achieving measurable accuracy gains with modest FLOP overhead. The method is stabilized by a router-controlling loss (score dispersion and distribution preservation) and a threshold control scheme (buffer proportional control with EMA synchronization), ensuring reliable token selection. Empirically, DND yields notable improvements across General Knowledge, Math/ STEM, and Coding benchmarks on Qwen3-1.7B and Qwen3-30B-A3B, with minimal parameter increase and limited speed impact, highlighting its practical potential for enhancing LLM performance without full-scale retraining.
Abstract
We introduce Dynamic Nested Depth (DND), a novel method that improves performance for off-the-shelf LLMs by selecting critical tokens to reprocess in a nested depth manner. Specifically, at the end of the given transformer layer, DND identifies more critical tokens with a router and feeds them back for an extra round of processing, effectively ``reviewing" difficult tokens while avoiding redundant computation for easier ones. The dynamic selection mechanism is tailored for precise control via two novel strategies: a router controlling loss to enhance token selection distinguishability, and a threshold control scheme to ensure selection stability. We demonstrate the effectiveness of DND by directly integrating it into pre-trained dense and MoE models during a post-training phase. On diverse benchmarks, this approach boosts the performances of the dense Qwen3-1.7B by 1.88% and the MoE Qwen3-30B-A3B by 0.87%, all with a minimal parameter and computing increase.
