Table of Contents
Fetching ...

Conformal Transformations for Symmetric Power Transformers

Saurabh Kumar, Jacob Buckman, Carles Gelada, Sean Zhang

TL;DR

This work tackles the context-length degradation of symmetric power transformers by introducing conformal-sympow, which adds data-dependent gating and data-dependent rotary embeddings to manage the finite recurrent state more effectively. By formulating a conformal transformation on the recurrent state, the method erases and reorganizes information as context scales, while adaptive rotations store information more efficiently. Empirical results on LongCrawl64 show that conformal-sympow maintains performance as training and evaluation context lengths grow, outperforming plain sympow and narrowing the gap with softmax baselines; learned rotary embeddings further boost both training and generalization. The approach offers a practical, low-overhead path to robust long-context processing with linear-time inference characteristics inherent to sympow architectures.

Abstract

Transformers with linear attention offer significant computational advantages over softmax-based transformers but often suffer from degraded performance. The symmetric power (sympow) transformer, a particular type of linear transformer, addresses some of this performance gap by leveraging symmetric tensor embeddings, achieving comparable performance to softmax transformers. However, the finite capacity of the recurrent state in sympow transformers limits their ability to retain information, leading to performance degradation when scaling the training or evaluation context length. To address this issue, we propose the conformal-sympow transformer, which dynamically frees up capacity using data-dependent multiplicative gating and adaptively stores information using data-dependent rotary embeddings. Preliminary experiments on the LongCrawl64 dataset demonstrate that conformal-sympow overcomes the limitations of sympow transformers, achieving robust performance across scaled training and evaluation contexts.

Conformal Transformations for Symmetric Power Transformers

TL;DR

This work tackles the context-length degradation of symmetric power transformers by introducing conformal-sympow, which adds data-dependent gating and data-dependent rotary embeddings to manage the finite recurrent state more effectively. By formulating a conformal transformation on the recurrent state, the method erases and reorganizes information as context scales, while adaptive rotations store information more efficiently. Empirical results on LongCrawl64 show that conformal-sympow maintains performance as training and evaluation context lengths grow, outperforming plain sympow and narrowing the gap with softmax baselines; learned rotary embeddings further boost both training and generalization. The approach offers a practical, low-overhead path to robust long-context processing with linear-time inference characteristics inherent to sympow architectures.

Abstract

Transformers with linear attention offer significant computational advantages over softmax-based transformers but often suffer from degraded performance. The symmetric power (sympow) transformer, a particular type of linear transformer, addresses some of this performance gap by leveraging symmetric tensor embeddings, achieving comparable performance to softmax transformers. However, the finite capacity of the recurrent state in sympow transformers limits their ability to retain information, leading to performance degradation when scaling the training or evaluation context length. To address this issue, we propose the conformal-sympow transformer, which dynamically frees up capacity using data-dependent multiplicative gating and adaptively stores information using data-dependent rotary embeddings. Preliminary experiments on the LongCrawl64 dataset demonstrate that conformal-sympow overcomes the limitations of sympow transformers, achieving robust performance across scaled training and evaluation contexts.

Paper Structure

This paper contains 20 sections, 7 theorems, 47 equations, 4 figures.

Key Result

Proposition 1

When using rotary embeddings with sympow transformers, the attention formulation of the output $Y_i$ at time step $i$ is equivalent to its recurrent formulation. Specifically,

Figures (4)

  • Figure 1: The training performance of sympow degrades relative to a softmax transformer baseline as the context size grows. Sympow with data-dependent gating (sympow+gating) closes this performance gap. Training performance further improves when adding data-dependent rotations with conformal-sympow. In contrast to sympow, conformal-sympow does not suffer from the degraded scaling of training context, either when (a) $p=4$ or (b) $p=2$.
  • Figure 2: Average loss at different evaluation context lengths ranging from $1$ to $65,536$ tokens. The training context size is $16,384$, indicated by the dashed red line. (a) Sympow is unable to generalize beyond the training context size of $16,384$. (b) Gated sympow generalizes well and conformal-sympow improves performance further.
  • Figure 3: Training curves for sympow, sympow+gating, and conformal-sympow with $p=4$ at different training context lengths. We can see that both sympow+gating and conformal-sympow improve optimization over sympow throughout training.
  • Figure 4: Training curves for sympow, sympow+gating, and conformal-sympow with $p=2$ at different training context lengths. We can see that both sympow+gating and conformal-sympow improve optimization over sympow throughout training.

Theorems & Definitions (11)

  • Proposition 1
  • Proposition 2
  • Proposition 3
  • Proposition
  • proof
  • Proposition 3
  • proof
  • Proposition 3
  • proof
  • Proposition 3
  • ...and 1 more