Table of Contents
Fetching ...

EquiformerV2: Improved Equivariant Transformer for Scaling to Higher-Degree Representations

Yi-Lun Liao, Brandon Wood, Abhishek Das, Tess Smidt

TL;DR

EquiformerV2 tackles the scalability barrier of equivariant Transformers for 3D atomistic data by substituting expensive $SO(3)$ tensor products with efficient eSCN convolutions and introducing three architectural refinements: attention re-normalization, separable $S^2$ activation, and separable layer normalization. The approach enables larger degree representations ($L_{max}$ up to 6–8) and yields state-of-the-art results on OC20/OC22 benchmarks, with notable improvements in forces and energies and improved speed-accuracy trade-offs. It also demonstrates data efficiency gains on OC22 and competitive adsorption-energy performance within AdsorbML, while analyses on QM9 and OC20 S2EF-2M clarify the benefits of higher degrees on large, diverse datasets. Overall, EquiformerV2 advances scalable, data-efficient equivariant transformers for practical material-science tasks, albeit with dataset-dependent gains and memory considerations for extreme degrees.

Abstract

Equivariant Transformers such as Equiformer have demonstrated the efficacy of applying Transformers to the domain of 3D atomistic systems. However, they are limited to small degrees of equivariant representations due to their computational complexity. In this paper, we investigate whether these architectures can scale well to higher degrees. Starting from Equiformer, we first replace $SO(3)$ convolutions with eSCN convolutions to efficiently incorporate higher-degree tensors. Then, to better leverage the power of higher degrees, we propose three architectural improvements -- attention re-normalization, separable $S^2$ activation and separable layer normalization. Putting this all together, we propose EquiformerV2, which outperforms previous state-of-the-art methods on large-scale OC20 dataset by up to $9\%$ on forces, $4\%$ on energies, offers better speed-accuracy trade-offs, and $2\times$ reduction in DFT calculations needed for computing adsorption energies. Additionally, EquiformerV2 trained on only OC22 dataset outperforms GemNet-OC trained on both OC20 and OC22 datasets, achieving much better data efficiency. Finally, we compare EquiformerV2 with Equiformer on QM9 and OC20 S2EF-2M datasets to better understand the performance gain brought by higher degrees.

EquiformerV2: Improved Equivariant Transformer for Scaling to Higher-Degree Representations

TL;DR

EquiformerV2 tackles the scalability barrier of equivariant Transformers for 3D atomistic data by substituting expensive tensor products with efficient eSCN convolutions and introducing three architectural refinements: attention re-normalization, separable activation, and separable layer normalization. The approach enables larger degree representations ( up to 6–8) and yields state-of-the-art results on OC20/OC22 benchmarks, with notable improvements in forces and energies and improved speed-accuracy trade-offs. It also demonstrates data efficiency gains on OC22 and competitive adsorption-energy performance within AdsorbML, while analyses on QM9 and OC20 S2EF-2M clarify the benefits of higher degrees on large, diverse datasets. Overall, EquiformerV2 advances scalable, data-efficient equivariant transformers for practical material-science tasks, albeit with dataset-dependent gains and memory considerations for extreme degrees.

Abstract

Equivariant Transformers such as Equiformer have demonstrated the efficacy of applying Transformers to the domain of 3D atomistic systems. However, they are limited to small degrees of equivariant representations due to their computational complexity. In this paper, we investigate whether these architectures can scale well to higher degrees. Starting from Equiformer, we first replace convolutions with eSCN convolutions to efficiently incorporate higher-degree tensors. Then, to better leverage the power of higher degrees, we propose three architectural improvements -- attention re-normalization, separable activation and separable layer normalization. Putting this all together, we propose EquiformerV2, which outperforms previous state-of-the-art methods on large-scale OC20 dataset by up to on forces, on energies, offers better speed-accuracy trade-offs, and reduction in DFT calculations needed for computing adsorption energies. Additionally, EquiformerV2 trained on only OC22 dataset outperforms GemNet-OC trained on both OC20 and OC22 datasets, achieving much better data efficiency. Finally, we compare EquiformerV2 with Equiformer on QM9 and OC20 S2EF-2M datasets to better understand the performance gain brought by higher degrees.
Paper Structure (62 sections, 8 equations, 6 figures, 14 tables)

This paper contains 62 sections, 8 equations, 6 figures, 14 tables.

Figures (6)

  • Figure 1: Overview of EquiformerV2. We highlight the differences from Equiformer equiformer in red. For (b), (c), and (d), the left figure is the original module in Equiformer, and the right figure is the revised module in EquiformerV2. Input 3D graphs are embedded with atom and edge-degree embeddings and processed with Transformer blocks, which consist of equivariant graph attention and feed forward networks. "$\otimes$" denotes multiplication, "$\oplus$" denotes addition, and $\sum$ within a circle denotes summation over all neighbors. "DTP" denotes depth-wise tensor products used in Equiformer. Gray cells indicate intermediate irreps features.
  • Figure 2: Illustration of different activation functions. $G$ denotes conversion from vectors to point samples on a sphere, $F$ can typically be a SiLU activation or MLPs, and $G^{-1}$ is the inverse of $G$.
  • Figure 3: Illustration of how statistics are calculated in different normalizations. "std" denotes standard deviation, and "RMS" denotes root mean square.
  • Figure 4: EquiformerV2 achieves better accuracy trade-offs both in terms of inference speed as well as training cost. All models are trained on the S2EF-2M split and measured on V100 GPUs.
  • Figure 5: Speed-accuracy trade-offs of different models when used in the AdsorbML algorithm.
  • ...and 1 more figures