Loop Neural Networks for Parameter Sharing
Kei-Sing Ng, Qingchen Wang
TL;DR
The paper introduces Loop Neural Networks that refine predictions by iteratively looping over a subset of transformer blocks with residual connections, enabling longer computation without increasing parameters. It formalizes an update rule with predictive residuals and demonstrates two looping strategies to raise effective depth. Empirical results on GPT-2 variants show loop models can match or exceed larger baselines (e.g., 81M-LOOP vs 124M) and improve even small models (45M-LOOP) without extra data, albeit with modest training-time overhead. This approach offers a scalable, resource-efficient path to enhancing language modeling performance on devices with limited capacity.
Abstract
The success of large-scale language models like GPT can be attributed to their ability to efficiently predict the next token in a sequence. However, these models rely on constant computational effort regardless of the complexity of the token they are predicting, lacking the capacity for iterative refinement. In this paper, we introduce a novel Loop Neural Network, which achieves better performance by utilizing longer computational time without increasing the model size. Our approach revisits the input multiple times, refining the prediction by iteratively looping over a subset of the model with residual connections. We demonstrate the effectiveness of this method through experiments comparing versions of GPT-2 with our loop models, showing improved performance in language modeling tasks while maintaining similar parameter counts. Importantly, these improvements are achieved without the need for extra training data.
