Loop Neural Networks for Parameter Sharing

Kei-Sing Ng; Qingchen Wang

Loop Neural Networks for Parameter Sharing

Kei-Sing Ng, Qingchen Wang

TL;DR

The paper introduces Loop Neural Networks that refine predictions by iteratively looping over a subset of transformer blocks with residual connections, enabling longer computation without increasing parameters. It formalizes an update rule with predictive residuals and demonstrates two looping strategies to raise effective depth. Empirical results on GPT-2 variants show loop models can match or exceed larger baselines (e.g., 81M-LOOP vs 124M) and improve even small models (45M-LOOP) without extra data, albeit with modest training-time overhead. This approach offers a scalable, resource-efficient path to enhancing language modeling performance on devices with limited capacity.

Abstract

The success of large-scale language models like GPT can be attributed to their ability to efficiently predict the next token in a sequence. However, these models rely on constant computational effort regardless of the complexity of the token they are predicting, lacking the capacity for iterative refinement. In this paper, we introduce a novel Loop Neural Network, which achieves better performance by utilizing longer computational time without increasing the model size. Our approach revisits the input multiple times, refining the prediction by iteratively looping over a subset of the model with residual connections. We demonstrate the effectiveness of this method through experiments comparing versions of GPT-2 with our loop models, showing improved performance in language modeling tasks while maintaining similar parameter counts. Importantly, these improvements are achieved without the need for extra training data.

Loop Neural Networks for Parameter Sharing

TL;DR

Abstract

Paper Structure (20 sections, 3 equations, 3 figures, 2 tables)

This paper contains 20 sections, 3 equations, 3 figures, 2 tables.

Introduction
Background and Related Work
Parameter Sharing and Adaptive Computation
Universal Transformers
Adaptive Computation Time Models
Depth-Adaptive Transformers
Parameter Sharing Across Layers
Methodology
General Loop Structure with Residual Design
Looping Strategies
Illustration of the Loop Neural Network
Comparison with Existing Methods
Experiments
Experimental Setup
Evaluation Metrics
...and 5 more sections

Figures (3)

Figure 1: Computational Graph with Multiple Transformer Blocks
Figure 2: Training and Validation Loss Curves for First Experiment
Figure 3: Training and Validation Loss Curves for Second Experiment

Loop Neural Networks for Parameter Sharing

TL;DR

Abstract

Loop Neural Networks for Parameter Sharing

Authors

TL;DR

Abstract

Table of Contents

Figures (3)