Table of Contents
Fetching ...

Toward Manifest Relationality in Transformers via Symmetry Reduction

J. François, L. Ravera

TL;DR

This work identifies substantial internal redundancy in Transformer models arising from coordinate-dependent representations and head-space symmetries, and proposes a symmetry-reduction framework that operates on invariant relational quantities. By drawing on the Dressing Field Method from gauge theory, it develops a relational reformulation of token representations and attention (e.g., using Gram invariants $G=XX^{\top}$ and invariant composites like $G_{QK}=W_Q^{\top}W_K$ and $G_{VO}=W_OW_V$) and outlines optimization on reduced spaces to avoid symmetry-induced degeneracies. It further details practical schemes for symmetry-free optimization, including dressing-based representatives, invariant parameterizations, and projected gradient flow, while accounting for residual ambiguities from degeneracies and architecture-dependent symmetry breaking (e.g., LayerNorm). The framework aims to reduce parameter redundancy, clarify optimization dynamics, and yield more interpretable relational structures, with empirical validation identified as an important future step. The work provides a principled path toward Transformer architectures whose representations, attention, and learning dynamics are expressed in manifestly relational terms, potentially improving efficiency and interpretability in large-scale models.

Abstract

Transformer models contain substantial internal redundancy arising from coordinate-dependent representations and continuous symmetries, in model space and in head space, respectively. While recent approaches address this by explicitly breaking symmetry, we propose a complementary framework based on symmetry reduction. We reformulate representations, attention mechanisms, and optimization dynamics in terms of invariant relational quantities, eliminating redundant degrees of freedom by construction. This perspective yields architectures that operate directly on relational structures, providing a principled geometric framework for reducing parameter redundancy and analyzing optimization.

Toward Manifest Relationality in Transformers via Symmetry Reduction

TL;DR

This work identifies substantial internal redundancy in Transformer models arising from coordinate-dependent representations and head-space symmetries, and proposes a symmetry-reduction framework that operates on invariant relational quantities. By drawing on the Dressing Field Method from gauge theory, it develops a relational reformulation of token representations and attention (e.g., using Gram invariants and invariant composites like and ) and outlines optimization on reduced spaces to avoid symmetry-induced degeneracies. It further details practical schemes for symmetry-free optimization, including dressing-based representatives, invariant parameterizations, and projected gradient flow, while accounting for residual ambiguities from degeneracies and architecture-dependent symmetry breaking (e.g., LayerNorm). The framework aims to reduce parameter redundancy, clarify optimization dynamics, and yield more interpretable relational structures, with empirical validation identified as an important future step. The work provides a principled path toward Transformer architectures whose representations, attention, and learning dynamics are expressed in manifestly relational terms, potentially improving efficiency and interpretability in large-scale models.

Abstract

Transformer models contain substantial internal redundancy arising from coordinate-dependent representations and continuous symmetries, in model space and in head space, respectively. While recent approaches address this by explicitly breaking symmetry, we propose a complementary framework based on symmetry reduction. We reformulate representations, attention mechanisms, and optimization dynamics in terms of invariant relational quantities, eliminating redundant degrees of freedom by construction. This perspective yields architectures that operate directly on relational structures, providing a principled geometric framework for reducing parameter redundancy and analyzing optimization.
Paper Structure (28 sections, 58 equations)