Table of Contents
Fetching ...

Single-Core Superscalar Optimization of Clifford Neural Layers

X. Angelo Huang, Ruben Ciranni, Giovanni Spadaccini, Carla J. López Zurita

TL;DR

This work addresses accelerating Clifford neural layers that realize $E(n)$ and $O(n)$ equivariance on CPU by exploiting Clifford algebra structure to reduce memory traffic and computations. It moves from a PyTorch-based implementation to a high-performance C backend with inlining, loop optimizations, and AVX2 SIMD, while preserving numerical correctness. The authors report an average speedup of $21.35\times$ across eleven functions and competitive performance relative to PyTorch in many cases, alongside a robust testing and benchmarking setup. The results demonstrate the practical viability of high-performance Clifford networks on CPU for physics- and geometry-inspired applications.

Abstract

Within the growing interest in the physical sciences in developing networks with equivariance properties, Clifford neural layers shine as one approach that delivers $E(n)$ and $O(n)$ equivariances given specific group actions. In this paper, we analyze the inner structure of the computation within Clifford convolutional layers and propose and implement several optimizations to speed up the inference process while maintaining correctness. In particular, we begin by analyzing the theoretical foundations of Clifford algebras to eliminate redundant matrix allocations and computations, then systematically apply established optimization techniques to enhance performance further. We report a final average speedup of 21.35x over the baseline implementation of eleven functions and runtimes comparable to and faster than the original PyTorch implementation in six cases. In the remaining cases, we achieve performance in the same order of magnitude as the original library.

Single-Core Superscalar Optimization of Clifford Neural Layers

TL;DR

This work addresses accelerating Clifford neural layers that realize and equivariance on CPU by exploiting Clifford algebra structure to reduce memory traffic and computations. It moves from a PyTorch-based implementation to a high-performance C backend with inlining, loop optimizations, and AVX2 SIMD, while preserving numerical correctness. The authors report an average speedup of across eleven functions and competitive performance relative to PyTorch in many cases, alongside a robust testing and benchmarking setup. The results demonstrate the practical viability of high-performance Clifford networks on CPU for physics- and geometry-inspired applications.

Abstract

Within the growing interest in the physical sciences in developing networks with equivariance properties, Clifford neural layers shine as one approach that delivers and equivariances given specific group actions. In this paper, we analyze the inner structure of the computation within Clifford convolutional layers and propose and implement several optimizations to speed up the inference process while maintaining correctness. In particular, we begin by analyzing the theoretical foundations of Clifford algebras to eliminate redundant matrix allocations and computations, then systematically apply established optimization techniques to enhance performance further. We report a final average speedup of 21.35x over the baseline implementation of eleven functions and runtimes comparable to and faster than the original PyTorch implementation in six cases. In the remaining cases, we achieve performance in the same order of magnitude as the original library.

Paper Structure

This paper contains 9 sections, 5 equations, 4 figures, 4 tables, 2 algorithms.

Figures (4)

  • Figure 1: A visualization of matrix-vector multiplication in $Cl_{1, 0}(\mathbb{R)}$ using Clifford kernel technique employed in the original computation with many optimization pathways. See section \ref{['par:original-computation']} for a discussion.
  • Figure 2: Speedup of each optimized function against their corresponding baseline function, without applying vectorization techniques. Corresponding input size parameter is stated in Tables \ref{['tab:parameters_linear']}, \ref{['tab:parameters_act_linear']} and \ref{['tab:parameters_conv']} as $n$.
  • Figure 3: Speedup of each optimized function against their corresponding baseline function with SIMD instructions. We compute the average across all values reported in this plot and achieve overall average speed-up of 21.35 times over the baseline implementation.
  • Figure 4: Roofline plots. Flops and bytes transferred were gathered using perf system call. Input size per function is the same as reported in Figure \ref{['fig:scalar_speedup']} and \ref{['fig:vectorised_speedup']}.