Turbo Connection: Reasoning as Information Flow from Higher to Lower Layers

Mohan Tang; Sidi Lu

Turbo Connection: Reasoning as Information Flow from Higher to Lower Layers

Mohan Tang, Sidi Lu

TL;DR

TurboConn introduces dense downward connections that route higher-layer token information to lower layers of the next token, thereby making the effective reasoning depth scale with sequence length as $kL$. The method preserves the standard autoregressive loss and incurs only modest training overhead, using grouping to manage latency. Empirically, TurboConn yields consistent accuracy gains across Llama and Qwen models on Parity, GSM8K, and multi-step arithmetic, including achieving $100\%$ Parity on some setups and improving length generalization and discriminative filtering. The results demonstrate that increasing latent depth via information flow, rather than solely increasing compute, is a viable path to enhanced reasoning in LLMs, with practical deployment that avoids large generation-time latency increases.

Abstract

Complex problems, whether in math, logic, or planning, are solved by humans through a sequence of steps where the result of one step informs the next. In this work, we adopt the perspective that the reasoning power of Transformers is fundamentally limited by a fixed maximum number of steps along any latent path of computation. To address this, we introduce Turbo Connection (TurboConn), a novel architecture that overcomes the fixed-depth constraint by routing multiple residual connections from the higher-layer hidden states of each token $t$ to the lower layers of token $t+1$. Fine-tuning pre-trained LLMs with our method not only yields accuracy gains of 0.9% to over 10% on benchmarks like GSM8K, Parity, and multi-step arithmetic, but also demonstrates that the density of these backward connections is critical; our dense interaction significantly outperforms "sparse" alternatives that only pass a single hidden state or vector. Notably, TurboConn can be integrated into pre-trained LLMs to overcome task-specific plateaus: while a fine-tuned Qwen-3-1.7B achieves only 53.78% on Parity, adding our architectural modification enables the model to reach 100% accuracy, all without the necessity to retrain the full model from scratch or sophisticated curriculum learning. Our results provide strong empirical evidence that the depth of the computational path is a key factor in reasoning ability, also offering a new mechanism to enhance LLMs without significantly affecting generation latency.

Turbo Connection: Reasoning as Information Flow from Higher to Lower Layers

TL;DR

TurboConn introduces dense downward connections that route higher-layer token information to lower layers of the next token, thereby making the effective reasoning depth scale with sequence length as

. The method preserves the standard autoregressive loss and incurs only modest training overhead, using grouping to manage latency. Empirically, TurboConn yields consistent accuracy gains across Llama and Qwen models on Parity, GSM8K, and multi-step arithmetic, including achieving

Parity on some setups and improving length generalization and discriminative filtering. The results demonstrate that increasing latent depth via information flow, rather than solely increasing compute, is a viable path to enhanced reasoning in LLMs, with practical deployment that avoids large generation-time latency increases.

Abstract

to the lower layers of token

. Fine-tuning pre-trained LLMs with our method not only yields accuracy gains of 0.9% to over 10% on benchmarks like GSM8K, Parity, and multi-step arithmetic, but also demonstrates that the density of these backward connections is critical; our dense interaction significantly outperforms "sparse" alternatives that only pass a single hidden state or vector. Notably, TurboConn can be integrated into pre-trained LLMs to overcome task-specific plateaus: while a fine-tuned Qwen-3-1.7B achieves only 53.78% on Parity, adding our architectural modification enables the model to reach 100% accuracy, all without the necessity to retrain the full model from scratch or sophisticated curriculum learning. Our results provide strong empirical evidence that the depth of the computational path is a key factor in reasoning ability, also offering a new mechanism to enhance LLMs without significantly affecting generation latency.

Paper Structure (41 sections, 20 equations, 4 figures, 6 tables)

This paper contains 41 sections, 20 equations, 4 figures, 6 tables.

Introduction
Related Work
Chain-of-Thought
Looped/Universal Transformers
On Relationship between Depth and Reasoning
Method
Background and Notations
Our Approach
Analysis of Depth
Analysis of Computational Cost
Grouping
Experiments
Reasoning Tasks
Training Hyperparameters
Main Results
...and 26 more sections

Figures (4)

Figure 1: A conceptual illustration of the proposed Turbo Connection for Transformers.
Figure 2: Modified Transformer architecture with downward connections (orange arrows) from higher to lower decoder layers.
Figure 3: Grouping strategy for downward connections. Example shown for group of 2.
Figure 4: Length Generalization Performance. We evaluate a Llama 3.1 8B model, trained on Parity with up to 10-digit sequences, on its ability to generalize to longer input sequences.

Turbo Connection: Reasoning as Information Flow from Higher to Lower Layers

TL;DR

Abstract

Turbo Connection: Reasoning as Information Flow from Higher to Lower Layers

Authors

TL;DR

Abstract

Table of Contents

Figures (4)