Conformal Transformations for Symmetric Power Transformers
Saurabh Kumar, Jacob Buckman, Carles Gelada, Sean Zhang
TL;DR
This work tackles the context-length degradation of symmetric power transformers by introducing conformal-sympow, which adds data-dependent gating and data-dependent rotary embeddings to manage the finite recurrent state more effectively. By formulating a conformal transformation on the recurrent state, the method erases and reorganizes information as context scales, while adaptive rotations store information more efficiently. Empirical results on LongCrawl64 show that conformal-sympow maintains performance as training and evaluation context lengths grow, outperforming plain sympow and narrowing the gap with softmax baselines; learned rotary embeddings further boost both training and generalization. The approach offers a practical, low-overhead path to robust long-context processing with linear-time inference characteristics inherent to sympow architectures.
Abstract
Transformers with linear attention offer significant computational advantages over softmax-based transformers but often suffer from degraded performance. The symmetric power (sympow) transformer, a particular type of linear transformer, addresses some of this performance gap by leveraging symmetric tensor embeddings, achieving comparable performance to softmax transformers. However, the finite capacity of the recurrent state in sympow transformers limits their ability to retain information, leading to performance degradation when scaling the training or evaluation context length. To address this issue, we propose the conformal-sympow transformer, which dynamically frees up capacity using data-dependent multiplicative gating and adaptively stores information using data-dependent rotary embeddings. Preliminary experiments on the LongCrawl64 dataset demonstrate that conformal-sympow overcomes the limitations of sympow transformers, achieving robust performance across scaled training and evaluation contexts.
