RingFormer: Rethinking Recurrent Transformer with Adaptive Level Signals
Jaemu Heo, Eldor Fozilov, Hyunmin Song, Taehwan Kim
TL;DR
RingFormer addresses the high parameter cost of Transformers by reusing a single Transformer block in a ring-like recurrent fashion and introducing input-dependent, depth-aware level signals generated via low-rank transformations. The signals modulate the attention and feedforward projections ($Q_i= W_Q x + g_{A_{Qi}}(x)$, $K_i= W_K x + g_{A_{Ki}}(x)$, $V_i= W_V x + g_{A_{Vi}}(x)$, and $FFN(x)=\sigma((x+g_{Fi}(x))W_{up})W_{down}$) to adapt across iterations while keeping parameters low ($M_i = A_i B_i^T$, with $r\ll d$). Across translation and image classification, RingFormer achieves competitive or superior performance relative to parameter-matched baselines, and analytical tools such as CKA and MAD corroborate closer behavioral alignment to Vanilla Transformer than previous recurrent designs. The work demonstrates that adaptive, low-rank level signals can dramatically improve parameter efficiency without sacrificing core Transformer capabilities, potentially democratizing access to high-performance models. limitations include added compute from the signals and a lack of large-scale language-model experiments, pointing to future work on scaling and broader-domain validation.
Abstract
Transformers have achieved great success in effectively processing sequential data such as text. Their architecture consisting of several attention and feedforward blocks can model relations between elements of a sequence in parallel manner, which makes them very efficient to train and effective in sequence modeling. Even though they have shown strong performance in processing sequential data, the size of their parameters is considerably larger when compared to other architectures such as RNN and CNN based models. Therefore, several approaches have explored parameter sharing and recurrence in Transformer models to address their computational demands. However, such methods struggle to maintain high performance compared to the original transformer model. To address this challenge, we propose our novel approach, RingFormer, which employs one Transformer layer that processes input repeatedly in a circular, ring-like manner, while utilizing low-rank matrices to generate input-dependent level signals. This allows us to reduce the model parameters substantially while maintaining high performance in a variety of tasks such as translation and image classification, as validated in the experiments.
