You Do Not Fully Utilize Transformer's Representation Capacity
Gleb Gerasimov, Yaroslav Aksenov, Nikita Balagansky, Viacheslav Sinii, Daniil Gavrilov
TL;DR
The paper tackles representation collapse in decoder-only Transformers by showing that relying solely on the previous layer's hidden state limits long-range and multi-step reasoning. It proposes Layer-Integrated Memory (LIMe), a lightweight extension that routes and blends representations from all earlier layers using a trainable per-head, per-layer router over pre-allocated Key–Value buffers, incurring minimal overhead. Empirically, LIMe delivers faster convergence and lower perplexity per FLOP, improves synthetic reasoning benchmarks, and enables very deep architectures to scale more effectively, while preserving higher value-vector entropy and better token separability. Analyses of learned routings reveal systematic reuse of local and long-distance features, demonstrating LIMe's capacity to mitigate collapse without increasing hidden-state size and suggesting new directions for latent-space reasoning in deep transformers.
Abstract
In contrast to RNNs, which compress their history into a single hidden state, Transformers can attend to all past tokens directly. However, standard Transformers rely solely on the hidden state from the previous layer to represent the entire context. We show that this design choice induces representation collapse and degrades performance. To address this issue, we introduce Layer-Integrated Memory (LIMe), a lightweight extension that leverages existing key-value buffers and learns per-head, per-layer routing weights to integrate representations from all previous layers with negligible overhead. Through extensive experiments-including language modeling, synthetic reasoning benchmarks, and very deep architectures-LIMe consistently achieves faster convergence, lower perplexity per FLOP, and substantial accuracy improvements on synthetic tasks while preserving higher value-vector entropy and improved token separability. Finally, our analysis of the learned routing weights reveals systematic reuse of both local and long-distance features, demonstrating how LIMe mitigates collapse, unlocks richer representations without increasing hidden-state size, and points to promising directions for future research.
