Table of Contents
Fetching ...

Fast computation of temperature and polarization coupling matrices

Georgia Kiddier, Steven Gratton

Abstract

We present a fast and exact method for computing CMB mode-coupling matrices based on an optimised evaluation of Wigner-3j symbols. The method exploits analytic structure in the relevant Wigner-3j symbol configurations appearing in temperature and polarization coupling matrices, expressing all required quantities in terms of a small set of recurrence-generated values which are precomputed and stored in lookup tables. This approach reduces the computational cost of constructing the full coupling matrices whilst maintaining numerical accuracy. We demonstrate the performance of the threej_cosmo implementation using realistic survey masks from current CMB experiments. Relative to standard recursion-based approaches used in existing pseudo-C_l pipelines, the method achieves speedups of 6-25x in practical coupling-matrix constructions, with the largest gains occurring at high multipoles. The algorithm admits efficient parallelisation on both CPUs and GPUs, the latter providing additional acceleration, up to a further order of 50 on modern hardware, without altering the underlying formalism. Beyond full matrix construction, the approach is naturally suited to applications in which only a restricted set of l3 modes is required for each (l1,l2) pair, such as in the computation of band-limited coupling matrices and analytic covariance terms. These features make threej_cosmo a practical backend for pseudo-C_l estimation and related calculations in next-generation CMB analysis pipelines.

Fast computation of temperature and polarization coupling matrices

Abstract

We present a fast and exact method for computing CMB mode-coupling matrices based on an optimised evaluation of Wigner-3j symbols. The method exploits analytic structure in the relevant Wigner-3j symbol configurations appearing in temperature and polarization coupling matrices, expressing all required quantities in terms of a small set of recurrence-generated values which are precomputed and stored in lookup tables. This approach reduces the computational cost of constructing the full coupling matrices whilst maintaining numerical accuracy. We demonstrate the performance of the threej_cosmo implementation using realistic survey masks from current CMB experiments. Relative to standard recursion-based approaches used in existing pseudo-C_l pipelines, the method achieves speedups of 6-25x in practical coupling-matrix constructions, with the largest gains occurring at high multipoles. The algorithm admits efficient parallelisation on both CPUs and GPUs, the latter providing additional acceleration, up to a further order of 50 on modern hardware, without altering the underlying formalism. Beyond full matrix construction, the approach is naturally suited to applications in which only a restricted set of l3 modes is required for each (l1,l2) pair, such as in the computation of band-limited coupling matrices and analytic covariance terms. These features make threej_cosmo a practical backend for pseudo-C_l estimation and related calculations in next-generation CMB analysis pipelines.
Paper Structure (17 sections, 46 equations, 2 figures)

This paper contains 17 sections, 46 equations, 2 figures.

Figures (2)

  • Figure 1: CPU time for the calculation of the mode-coupling matrices $K^{EE}$ (left) and $K^{TT}$ (right) as a function of $\ell_{\max}$, comparing the threej_cosmo implementation with the reference Schulten--Gordon (S-G) algorithm. Tests use the ACT DR6 mask and were run on an 8-core Apple M3 CPU. Error bars show the standard deviation over 5 runs per $\ell_{\max}$. For TT, threej_cosmo achieves speedups of $\sim 20$–$25\times$ for $\ell_{\max}\gtrsim 2000$ (e.g. $25.3\times$ at $\ell_{\max}=4722$), while for EE the speedup is typically $\sim 6$–$7\times$ (e.g. $6.7\times$ at $\ell_{\max}=4722$).
  • Figure 2: Kernel execution time as a function of multipole moment $\ell_{\mathrm{max}}$ for the GPU-accelerated threej_cosmo code computing $K^{TT}$ and $K^{EE}$ coupling matrices. Benchmarks were performed on an NVIDIA A100 GPU (40 GB) using OpenMP target offloading with linearized triangular iteration for full parallelization. Times shown are kernel execution only, excluding data transfer overhead (${\sim}0.4$ s per invocation). Error bars show the standard deviation over 5 runs. Top: Linear scale showing $K^{TT}$ computation completes in 0.63 s and $K^{EE}$ in 3.2 s at $\ell_{\max} = 10^4$. Bottom: Log-log scale with $\ell_{\max}^3$ reference line (dashed), confirming the expected cubic scaling.