Table of Contents
Fetching ...

Beyond Parameters: Exploring Virtual Logic Depth for Scaling Laws

Ruike Zhu, Hanwen Zhang, Kevin Li, Tianyu Shi, Yiqun Duan, Chi Wang, Tianyi Zhou, Arindam Banerjee, Zengyi Qin

TL;DR

This work introduces Virtual Logical Depth (VLD), a fourth scaling dimension that increases effective algorithmic depth by reusing transformer layers with shared parameters, avoiding additional parameters. The authors develop entropy-based and task-based measures to separately quantify knowledge capacity and reasoning capability, using a high-entropy random dataset for memory and the iGSM synthetic and real-world benchmarks for reasoning. Across controlled pretraining and post-training experiments, VLD consistently maintains near-constant knowledge capacity while delivering substantial improvements in reasoning, with cycle-pattern reuse often providing the strongest gains and smaller VLD-augmented models sometimes outperforming larger baselines. The findings suggest a promising parameter-efficient scaling path that decouples reasoning from sheer model size and invite further exploration of how parameter reuse interacts with traditional scaling strategies in pursuit of robust, scalable intelligence.

Abstract

Scaling large language models typically involves three dimensions: depth, width, and parameter count. In this work, we explore a fourth dimension, \textbf{virtual logical depth} (VLD), which increases effective algorithmic depth without changing parameter count by reusing weights. While parameter reuse is not new, its role in scaling has been underexplored. Unlike recent test-time methods that scale token-wise, VLD alters the internal computation graph during training and inference. Through controlled experiments, we obtain three key insights. (1) \textit{Knowledge capacity vs. parameters}: at fixed parameter count, VLD leaves knowledge capacity nearly unchanged, while across models capacity still scales with parameters. (2) \textit{Reasoning vs. reuse}: properly implemented VLD substantially improves reasoning ability \emph{without} more parameters, decoupling reasoning from size. This suggests a new scaling path beyond token-wise test-time methods. (3) \textit{Robustness and generality}: reasoning gains persist across architectures and reuse schedules, showing VLD captures a general scaling behavior. These results provide insight into future scaling strategies and raise a deeper question: does superintelligence require ever-larger models, or can it be achieved by reusing parameters and increasing logical depth? We argue many unknown dynamics in scaling remain to be explored. Code is available at https://anonymous.4open.science/r/virtual_logical_depth-8024/.

Beyond Parameters: Exploring Virtual Logic Depth for Scaling Laws

TL;DR

This work introduces Virtual Logical Depth (VLD), a fourth scaling dimension that increases effective algorithmic depth by reusing transformer layers with shared parameters, avoiding additional parameters. The authors develop entropy-based and task-based measures to separately quantify knowledge capacity and reasoning capability, using a high-entropy random dataset for memory and the iGSM synthetic and real-world benchmarks for reasoning. Across controlled pretraining and post-training experiments, VLD consistently maintains near-constant knowledge capacity while delivering substantial improvements in reasoning, with cycle-pattern reuse often providing the strongest gains and smaller VLD-augmented models sometimes outperforming larger baselines. The findings suggest a promising parameter-efficient scaling path that decouples reasoning from sheer model size and invite further exploration of how parameter reuse interacts with traditional scaling strategies in pursuit of robust, scalable intelligence.

Abstract

Scaling large language models typically involves three dimensions: depth, width, and parameter count. In this work, we explore a fourth dimension, \textbf{virtual logical depth} (VLD), which increases effective algorithmic depth without changing parameter count by reusing weights. While parameter reuse is not new, its role in scaling has been underexplored. Unlike recent test-time methods that scale token-wise, VLD alters the internal computation graph during training and inference. Through controlled experiments, we obtain three key insights. (1) \textit{Knowledge capacity vs. parameters}: at fixed parameter count, VLD leaves knowledge capacity nearly unchanged, while across models capacity still scales with parameters. (2) \textit{Reasoning vs. reuse}: properly implemented VLD substantially improves reasoning ability \emph{without} more parameters, decoupling reasoning from size. This suggests a new scaling path beyond token-wise test-time methods. (3) \textit{Robustness and generality}: reasoning gains persist across architectures and reuse schedules, showing VLD captures a general scaling behavior. These results provide insight into future scaling strategies and raise a deeper question: does superintelligence require ever-larger models, or can it be achieved by reusing parameters and increasing logical depth? We argue many unknown dynamics in scaling remain to be explored. Code is available at https://anonymous.4open.science/r/virtual_logical_depth-8024/.

Paper Structure

This paper contains 56 sections, 7 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Comparison between classical model size scaling vs. VLD scaling. Bubble size proportional to total effective depth (base layer depth + virtual logical depth). Blue bubbles represent standard models without VLD scaling but following the classical model size scaling, while green bubbles show models with VLD scaling applied, demonstrating near-vertical scaling paths. VLD scaling significantly enhances reasoning capability (y-axis) while keeping the knowledge capacity (x-axis) almost constant without significant variations.
  • Figure 2: Different patterns of parameter reuse to increase VLD while keeping the total number of parameters constant. (a) an example of a standard transformer with 3 layers. (b) sequentially repeat neighboring layers. (c) cycle-repeat the layers. (d) repeat layers in an inverse-cycle order. Under the same pattern, two layers with the same color share the same parameters. In all cases, the actual number of parameters do not change.
  • Figure 3: Reasoning Capability Under VLD patterns. (a) Standard scaling with native depth (16-layer $\approx$ 200M; 12-layer $\approx$ 150M). (b--d) Sequence, Cycle, and Inverse Cycle applied to 4-layer (50M; red) and 8-layer (100M; blue) backbones. Train from scratch, test with op15 GSM data.
  • Figure 4: Knowledge Capacity Under VLD. (a) Non-VLD: capacity increases with parameter count. (b) At fixed parameters, absorbed information stays nearly constant across VLD depths/patterns for 5M and 20M models.