Does equivariance matter at scale?
Johann Brehmer, Sönke Behrends, Pim de Haan, Taco Cohen
TL;DR
The paper investigates whether embedding known symmetries via E(3)-equivariant architectures yields advantages at scale compared to learning symmetries from data. Using a rigid-body interaction benchmark and two architectures (baseline Transformer vs. GATr), it demonstrates power-law compute scaling and a persistent equivariant advantage across budgets, while showing that data augmentation can largely close the data-efficiency gap. It also reveals distinct compute-allocation strategies between the two models, indicating that symmetry-aware designs influence how compute should be distributed between model size and training steps. The findings suggest that strong inductive biases can aid performance not only in low-data regimes but also under large data and compute, though practical gains depend on efficient implementations. Overall, the work highlights the continued relevance of incorporating symmetry principles in scalable neural modeling and motivates further efficiency-focused development of equivariant architectures.
Abstract
Given large datasets and sufficient compute, is it beneficial to design neural architectures for the structure and symmetries of each problem? Or is it more efficient to learn them from data? We study empirically how equivariant and non-equivariant networks scale with compute and training samples. Focusing on a benchmark problem of rigid-body interactions and on general-purpose transformer architectures, we perform a series of experiments, varying the model size, training steps, and dataset size. We find evidence for three conclusions. First, equivariance improves data efficiency, but training non-equivariant models with data augmentation can close this gap given sufficient epochs. Second, scaling with compute follows a power law, with equivariant models outperforming non-equivariant ones at each tested compute budget. Finally, the optimal allocation of a compute budget onto model size and training duration differs between equivariant and non-equivariant models.
