Table of Contents
Fetching ...

Platonic Transformers: A Solid Choice For Equivariance

Mohammad Mohaiminul Islam, Rishabh Anand, David R. Wessels, Friso de Kruiff, Thijs P. Kuipers, Rex Ying, Clara I. Sánchez, Sharvaree Vadgama, Georg Bökman, Erik J. Bekkers

TL;DR

The paper addresses the lack of geometric inductive biases in Transformers by introducing the Platonic Transformer, which attains equivariance to continuous translations and discrete roto-reflections using reference frames from Platonic symmetry groups without modifying the underlying architecture or computation. It recasts RoPE-based attention as a dynamic group convolution and provides a formal, end-to-end equivariance guarantee through frame-wise weight-sharing and group convolutions, with linear-time complexity in the convolutional view. Empirically, the method achieves competitive or state-of-the-art results on diverse domains, including CIFAR-10, ScanObjectNN, QM9, and OMol25, while preserving Transformer efficiency and enabling scalable inference. By combining principled geometric biases with scalable design, the Platonic Transformer offers a practical pathway to symmetry-aware learning in scientific applications, supported by reproducibility efforts and open-source code.

Abstract

While widespread, Transformers lack inductive biases for geometric symmetries common in science and computer vision. Existing equivariant methods often sacrifice the efficiency and flexibility that make Transformers so effective through complex, computationally intensive designs. We introduce the Platonic Transformer to resolve this trade-off. By defining attention relative to reference frames from the Platonic solid symmetry groups, our method induces a principled weight-sharing scheme. This enables combined equivariance to continuous translations and Platonic symmetries, while preserving the exact architecture and computational cost of a standard Transformer. Furthermore, we show that this attention is formally equivalent to a dynamic group convolution, which reveals that the model learns adaptive geometric filters and enables a highly scalable, linear-time convolutional variant. Across diverse benchmarks in computer vision (CIFAR-10), 3D point clouds (ScanObjectNN), and molecular property prediction (QM9, OMol25), the Platonic Transformer achieves competitive performance by leveraging these geometric constraints at no additional cost.

Platonic Transformers: A Solid Choice For Equivariance

TL;DR

The paper addresses the lack of geometric inductive biases in Transformers by introducing the Platonic Transformer, which attains equivariance to continuous translations and discrete roto-reflections using reference frames from Platonic symmetry groups without modifying the underlying architecture or computation. It recasts RoPE-based attention as a dynamic group convolution and provides a formal, end-to-end equivariance guarantee through frame-wise weight-sharing and group convolutions, with linear-time complexity in the convolutional view. Empirically, the method achieves competitive or state-of-the-art results on diverse domains, including CIFAR-10, ScanObjectNN, QM9, and OMol25, while preserving Transformer efficiency and enabling scalable inference. By combining principled geometric biases with scalable design, the Platonic Transformer offers a practical pathway to symmetry-aware learning in scientific applications, supported by reproducibility efforts and open-source code.

Abstract

While widespread, Transformers lack inductive biases for geometric symmetries common in science and computer vision. Existing equivariant methods often sacrifice the efficiency and flexibility that make Transformers so effective through complex, computationally intensive designs. We introduce the Platonic Transformer to resolve this trade-off. By defining attention relative to reference frames from the Platonic solid symmetry groups, our method induces a principled weight-sharing scheme. This enables combined equivariance to continuous translations and Platonic symmetries, while preserving the exact architecture and computational cost of a standard Transformer. Furthermore, we show that this attention is formally equivalent to a dynamic group convolution, which reveals that the model learns adaptive geometric filters and enables a highly scalable, linear-time convolutional variant. Across diverse benchmarks in computer vision (CIFAR-10), 3D point clouds (ScanObjectNN), and molecular property prediction (QM9, OMol25), the Platonic Transformer achieves competitive performance by leveraging these geometric constraints at no additional cost.

Paper Structure

This paper contains 59 sections, 5 theorems, 40 equations, 4 figures, 8 tables.

Key Result

Proposition 1

Our proposed Transformer architecture is an equivariant model. A global roto-reflection $R \in \mathcal{G}$ applied to the input point cloud results in a corresponding transformation $L_R$ of the final output feature maps.

Figures (4)

  • Figure 1: Visualization of Weight-Shared RoPE within the $N$-layer Platonic Transformer. Scalar and vector inputs are lifted to become functions on the platonic solid symmetry group of choice (here, the Tetrahedral group). The same multi-head self-attention mechanism is applied in parallel, with each instance rotating the features according to a different reference frame $R_i \in \mathcal{G}$. Choosing the trivial group as $\mathcal{G}$ reduces this framework to a standard Transformer.
  • Figure 1: CIFAR-10 Accuracy (%).
  • Figure 2: Elements of the symmetry groups of platonic solids form a subgroup of $SO(3)$.
  • Figure 4: We visualize the weight matrices for linear layers that are equivariant under the tetrahedral rotation group, implemented in the Fourier domain. Each subfigure shows weights to the left and features to the right. Purple features transform according to $\rho_1$ (or technically $\rho_1\otimes I_C$ since there are $C$ copies of $\rho_1$), red features according to $\rho_2$ (by multiplication by $\rho_2(g)\otimes I_C$ from the left) and green features according to $\rho_3$ (by multiplication by $\rho_3(g)^\top$ from the right (if we flattened the green features, they would transform by $\rho_3(g)\otimes I_{3C}$ from the left)). The weight matrix is parameterized by the $C\times C$ matrices $W_1, W_{21}, W_{22}$ and the $3C\times 3C$ matrix $W_3$, yielding a total of $12C^2$ learnable parameters. The total number of multiplications to compute the linear layer implemented as a batched matrix-multiplication in \ref{['fig:fourier_batched']} is $4\cdot(3C)^2=36C^2$, yielding a $4\times$ FLOP reduction versus an ordinary layer from $12C$ to $12C$ dimensions ($144C^2$ multiplications).

Theorems & Definitions (14)

  • Proposition 1: End-to-End Equivariance
  • Proposition 2: Linear RoPE Attention as Dynamic Convolution
  • Remark 1: Purely Geometric vs. Mixed Kernels
  • Corollary 1: Linear-Time Complexity
  • Definition 1: Representation
  • Definition 2: Unitary Representation
  • Definition 3: Irreducible Representation
  • Definition 4: Rotary Position Embedding (RoPE) Operator
  • Proposition 3: Translation Invariance of the RoPE Attention Score
  • proof
  • ...and 4 more