Table of Contents
Fetching ...

Does equivariance matter at scale?

Johann Brehmer, Sönke Behrends, Pim de Haan, Taco Cohen

TL;DR

The paper investigates whether embedding known symmetries via E(3)-equivariant architectures yields advantages at scale compared to learning symmetries from data. Using a rigid-body interaction benchmark and two architectures (baseline Transformer vs. GATr), it demonstrates power-law compute scaling and a persistent equivariant advantage across budgets, while showing that data augmentation can largely close the data-efficiency gap. It also reveals distinct compute-allocation strategies between the two models, indicating that symmetry-aware designs influence how compute should be distributed between model size and training steps. The findings suggest that strong inductive biases can aid performance not only in low-data regimes but also under large data and compute, though practical gains depend on efficient implementations. Overall, the work highlights the continued relevance of incorporating symmetry principles in scalable neural modeling and motivates further efficiency-focused development of equivariant architectures.

Abstract

Given large datasets and sufficient compute, is it beneficial to design neural architectures for the structure and symmetries of each problem? Or is it more efficient to learn them from data? We study empirically how equivariant and non-equivariant networks scale with compute and training samples. Focusing on a benchmark problem of rigid-body interactions and on general-purpose transformer architectures, we perform a series of experiments, varying the model size, training steps, and dataset size. We find evidence for three conclusions. First, equivariance improves data efficiency, but training non-equivariant models with data augmentation can close this gap given sufficient epochs. Second, scaling with compute follows a power law, with equivariant models outperforming non-equivariant ones at each tested compute budget. Finally, the optimal allocation of a compute budget onto model size and training duration differs between equivariant and non-equivariant models.

Does equivariance matter at scale?

TL;DR

The paper investigates whether embedding known symmetries via E(3)-equivariant architectures yields advantages at scale compared to learning symmetries from data. Using a rigid-body interaction benchmark and two architectures (baseline Transformer vs. GATr), it demonstrates power-law compute scaling and a persistent equivariant advantage across budgets, while showing that data augmentation can largely close the data-efficiency gap. It also reveals distinct compute-allocation strategies between the two models, indicating that symmetry-aware designs influence how compute should be distributed between model size and training steps. The findings suggest that strong inductive biases can aid performance not only in low-data regimes but also under large data and compute, though practical gains depend on efficient implementations. Overall, the work highlights the continued relevance of incorporating symmetry principles in scalable neural modeling and motivates further efficiency-focused development of equivariant architectures.

Abstract

Given large datasets and sufficient compute, is it beneficial to design neural architectures for the structure and symmetries of each problem? Or is it more efficient to learn them from data? We study empirically how equivariant and non-equivariant networks scale with compute and training samples. Focusing on a benchmark problem of rigid-body interactions and on general-purpose transformer architectures, we perform a series of experiments, varying the model size, training steps, and dataset size. We find evidence for three conclusions. First, equivariance improves data efficiency, but training non-equivariant models with data augmentation can close this gap given sufficient epochs. Second, scaling with compute follows a power law, with equivariant models outperforming non-equivariant ones at each tested compute budget. Finally, the optimal allocation of a compute budget onto model size and training duration differs between equivariant and non-equivariant models.

Paper Structure

This paper contains 45 sections, 9 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Scaling with compute. The dots show the training compute budget and test loss in our experiments, the lines indicate the compute-optimal performance according to the scaling laws we find, the error bands estimate the uncertainty on the power-law coefficients. The test losses of both non-equivariant () and equivariant () transformers scale as a power law with compute, and the equivariant model outperforms the non-equivariant model by a similar factor at all tested compute budgets.
  • Figure 2: Scaling with training data. We show the performance of the non-equivariant transformer (), non-equivariant transformer trained with data augmentation (), and equivariant transformer () as a function of the number of unique tokens in the training dataset. All experiments use the same training compute budget, which means that the number of epochs reduces from left to right. Equivariance improves data efficiency compared to the baseline, but data augmentation can close this gap.
  • Figure 3: Test loss (dotted circles) and scaling-law predictions (background color) as a function of model size and training tokens. Left: non-equivariant transformer. Right: equivariant transformer. In both cases, we observe good agreement of model performance and scaling-law fit.
  • Figure 4: Model performance at different training compute budgets (panels) as a function of the model size. We show our experiments (dots) and the predictions of our scaling-law fit (lines). The scaling-law fit describes the measurements well.
  • Figure 5: Optimal parameter allocation. We show the compute-optimal model size as a function of the training compute budget for the equivariant transformer () and the non-equivariant transformer (). The equivariant architecture requires smaller models to achieve a compute-optimal performance, but this gap closes for bigger compute budgets.
  • ...and 1 more figures