Table of Contents
Fetching ...

Tensor Decomposition for Non-Clifford Gate Minimization

Kirill Khoruzhii, Patrick Gelß, Sebastian Pokutta

TL;DR

Algebraic methods are developed that match or improve all reported results for both Toffoli and $T$-count, with most circuits completing in under a minute on a single CPU instead of thousands of TPUs used by prior work.

Abstract

Fault-tolerant quantum computation requires minimizing non-Clifford gates, whose implementation via magic state distillation dominates the resource costs. While $T$-count minimization is well-studied, dedicated $CCZ$ factories shift the natural target to direct Toffoli minimization. We develop algebraic methods for this problem, building on a connection between Toffoli count and tensor decomposition over $\mathbb{F}_2$. On standard benchmarks, these methods match or improve all reported results for both Toffoli and $T$-count, with most circuits completing in under a minute on a single CPU instead of thousands of TPUs used by prior work.

Tensor Decomposition for Non-Clifford Gate Minimization

TL;DR

Algebraic methods are developed that match or improve all reported results for both Toffoli and -count, with most circuits completing in under a minute on a single CPU instead of thousands of TPUs used by prior work.

Abstract

Fault-tolerant quantum computation requires minimizing non-Clifford gates, whose implementation via magic state distillation dominates the resource costs. While -count minimization is well-studied, dedicated factories shift the natural target to direct Toffoli minimization. We develop algebraic methods for this problem, building on a connection between Toffoli count and tensor decomposition over . On standard benchmarks, these methods match or improve all reported results for both Toffoli and -count, with most circuits completing in under a minute on a single CPU instead of thousands of TPUs used by prior work.
Paper Structure (15 sections, 29 equations, 5 figures, 2 tables)

This paper contains 15 sections, 29 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Non-Clifford gate minimization as tensor decomposition. a) Overview. Up to Clifford gates, any Clifford+$T$ circuit can be reduced to a diagonal phase operator $U_f$. Reducing non-Clifford cost is then recast as a tensor-decomposition problem: a Waring decomposition can be applied to optimize the $T$-count of the full phase polynomial $f(x)$, while a CP decomposition targets only the cubic part $f_3(x)$ to reduce the number of $CCZ$/Toffoli gates (for arithmetic circuits one often has $f = 4 f_3$). In practice, a CP decomposition can serve as a useful initialization for a subsequent Waring decomposition. b) Waring decomposition. A symmetric order-3 signature tensor is written as a sum of $r_w$ rank-1 terms, giving a realization with $r_w$$T$ gates; inset shows the circuit pattern for one rank-1 term (compute a parity, apply $T$, uncompute). c) CP decomposition. The cubic tensor is expressed as a sum of $r_{\mathrm{cp}}$ rank-1 outer products, yielding an implementation with $r_{\mathrm{cp}}$$CCZ$ (equivalently Toffoli) gates acting on parity qubits. Toffoli and $CCZ$ are interconvertible by two Hadamards as illustrated in the inset.
  • Figure 2: Flip graph symmetries, operations, and structure. a) Local CNOT basis changes applied before a $CCZ$ gate can always be compensated by Clifford gates; algebraically, a cubic rank-1 term $(u,v,w)$ depends only on the 3D subspace $\langle u,v,w\rangle$ up to $\mathrm{GL}(3,2)$ symmetries, visualized as a Fano plane. Any triangle (example highlighted in blue) can represent the same term. b) The flip graph is generated by three local transformations. A reduction decreases rank by collecting all terms over a shared factor and minimizing the induced quadratic form. A flip rewires two terms sharing a common factor $w$ while preserving rank: $u_1 v_1 w + u_2 v_2 w \mapsto (u_1 + u_2) v_1 w + u_2 (v_1 + v_2) w$. A plus combines an inverse reduction with a flip to escape local plateaus, temporarily increasing the rank: $u_1 v_1 w_1 + u_2 v_2 w_2 \mapsto (u_1 + u_2) v_1 w_1 + u_2 (v_1 + v_2) w_2 + u_2 v_1 (w_1 + w_2)$. c) Schematic flip graph structure: vertices are schemes grouped by rank; horizontal edges correspond to flips within a fixed rank, vertical edges to reductions.
  • Figure 3: Performance on the planted CP benchmark. For each size and planted rank $r_{\textnormal{pl}}$, we sample $U,V,W$ uniformly and form $T_{ijk}=\sum_{q=1}^{r_{\textnormal{pl}}} U_{qi} V_{qj} W_{qk}$. We then run the decomposition methods on $T$ and report the recovered rank, averaged over 4 independently generated tensors for each parameter pair. a) Heatmaps show the recovered rank (logarithmic color scale) returned by BCO, SGE, FGS, and the combined pipeline BCO+SGE+FGS. The white contour separates the region where the recovered rank exceeds the planted upper bound by more than one. b) Greedy vs. beam-search variants for BCO and SGE, using beam width $2^{10}$. c) Rank improvement from FGS after BCO+SGE. Dashed lines indicate equality.
  • Figure 4: Bilinear circuit optimization pipeline. The example shows a $\mathrm{GF}(2^2)$ multiplication circuit. (a) Hadamard conjugation on the output register converts Toffoli gates to $CCZ$ gates; adjacent $H$ gates cancel. (b) Low-rank CP decomposition of the bilinear tensor reduces the $CCZ$ count from 5 to 3. (c) Inverse Hadamard conjugation restores Toffoli form.
  • Figure 5: Waring decomposition performance on planted CP benchmark. For the same planted CP tensors $T$ from Fig. \ref{['fig:rnd-bench']}, we then seek a minimal Waring decomposition of the signature tensor $S = \mathrm{Alt}(T)$. We compare three approaches: greedy selection from the original FastTODD vandaele_2025a, beam search with width $2^{10}$, and greedy selection with CP initialization. Results show the recovered Waring rank $r_{\textnormal{w}}$ averaged over 4 independently generated tensors.