Turbo Connection: Reasoning as Information Flow from Higher to Lower Layers
Mohan Tang, Sidi Lu
TL;DR
TurboConn introduces dense downward connections that route higher-layer token information to lower layers of the next token, thereby making the effective reasoning depth scale with sequence length as $kL$. The method preserves the standard autoregressive loss and incurs only modest training overhead, using grouping to manage latency. Empirically, TurboConn yields consistent accuracy gains across Llama and Qwen models on Parity, GSM8K, and multi-step arithmetic, including achieving $100\%$ Parity on some setups and improving length generalization and discriminative filtering. The results demonstrate that increasing latent depth via information flow, rather than solely increasing compute, is a viable path to enhanced reasoning in LLMs, with practical deployment that avoids large generation-time latency increases.
Abstract
Complex problems, whether in math, logic, or planning, are solved by humans through a sequence of steps where the result of one step informs the next. In this work, we adopt the perspective that the reasoning power of Transformers is fundamentally limited by a fixed maximum number of steps along any latent path of computation. To address this, we introduce Turbo Connection (TurboConn), a novel architecture that overcomes the fixed-depth constraint by routing multiple residual connections from the higher-layer hidden states of each token $t$ to the lower layers of token $t+1$. Fine-tuning pre-trained LLMs with our method not only yields accuracy gains of 0.9% to over 10% on benchmarks like GSM8K, Parity, and multi-step arithmetic, but also demonstrates that the density of these backward connections is critical; our dense interaction significantly outperforms "sparse" alternatives that only pass a single hidden state or vector. Notably, TurboConn can be integrated into pre-trained LLMs to overcome task-specific plateaus: while a fine-tuned Qwen-3-1.7B achieves only 53.78% on Parity, adding our architectural modification enables the model to reach 100% accuracy, all without the necessity to retrain the full model from scratch or sophisticated curriculum learning. Our results provide strong empirical evidence that the depth of the computational path is a key factor in reasoning ability, also offering a new mechanism to enhance LLMs without significantly affecting generation latency.
