Table of Contents
Fetching ...

A Scalable Diagonalization Framework for Tensor-Product Bitstring Selected Configuration Interaction

Enhua Xu, William Dawson, Himadri Pathak, Takahito Nakajima

TL;DR

A fully distributed diagonalization framework tailored for extremely large selected determinant spaces, directly addressing this major scalability bottleneck of modern SCI methods and establishing TBSCI as a scalable SCI methodology.

Abstract

Selected configuration interaction (SCI) methods are effective for treating strongly correlated electronic systems, yet their scalability has long been limited by implementations that replicate the configuration interaction (CI) vector across processes, leading to severe memory bottlenecks. Here, we present a fully distributed diagonalization framework tailored for extremely large selected determinant spaces, directly addressing this major scalability bottleneck of modern SCI methods. The method is grounded in a tensor-product bitstring (TPB) representation, in which determinants are organized through a TPB structure constructed from selected alpha- and beta-bitstrings, and is referred to as tensor-product bitstring SCI (TBSCI). An efficient TBSCI eigensolver is developed based on a novel bitstring-based Hamiltonian evaluation algorithm together with a suite of MPI communication strategies designed to improve parallel efficiency. Large-scale full configuration interaction (FCI) benchmarks, employed as communication-intensive stress tests, demonstrate that the implemented TBSCI eigensolver continues to reduce the wall time for distributed diagonalization of 2.6 trillion determinants, reaching 54,000 nodes (more than 2.5 million cores) on supercomputer Fugaku. Beyond scalability, we investigate the structural compactness of the TPB representation and show that selecting alpha- and beta-bitstrings according to their collective weights in a reference SCI wavefunction yields TPB-based wavefunctions approaching the FCI limit while using only a small fraction of determinants. These results establish TBSCI as a scalable SCI methodology and provide evidence for the intrinsic compactness of the TPB representation.

A Scalable Diagonalization Framework for Tensor-Product Bitstring Selected Configuration Interaction

TL;DR

A fully distributed diagonalization framework tailored for extremely large selected determinant spaces, directly addressing this major scalability bottleneck of modern SCI methods and establishing TBSCI as a scalable SCI methodology.

Abstract

Selected configuration interaction (SCI) methods are effective for treating strongly correlated electronic systems, yet their scalability has long been limited by implementations that replicate the configuration interaction (CI) vector across processes, leading to severe memory bottlenecks. Here, we present a fully distributed diagonalization framework tailored for extremely large selected determinant spaces, directly addressing this major scalability bottleneck of modern SCI methods. The method is grounded in a tensor-product bitstring (TPB) representation, in which determinants are organized through a TPB structure constructed from selected alpha- and beta-bitstrings, and is referred to as tensor-product bitstring SCI (TBSCI). An efficient TBSCI eigensolver is developed based on a novel bitstring-based Hamiltonian evaluation algorithm together with a suite of MPI communication strategies designed to improve parallel efficiency. Large-scale full configuration interaction (FCI) benchmarks, employed as communication-intensive stress tests, demonstrate that the implemented TBSCI eigensolver continues to reduce the wall time for distributed diagonalization of 2.6 trillion determinants, reaching 54,000 nodes (more than 2.5 million cores) on supercomputer Fugaku. Beyond scalability, we investigate the structural compactness of the TPB representation and show that selecting alpha- and beta-bitstrings according to their collective weights in a reference SCI wavefunction yields TPB-based wavefunctions approaching the FCI limit while using only a small fraction of determinants. These results establish TBSCI as a scalable SCI methodology and provide evidence for the intrinsic compactness of the TPB representation.

Paper Structure

This paper contains 19 sections, 7 equations, 3 figures, 6 tables, 1 algorithm.

Figures (3)

  • Figure 1: Distribution of all determinants from $\mathcal{D}_\mathrm{TPB}$ into segments across $N$ processes (in this work each node runs exactly one process). Each process holds a strict subset of $m_p$$\alpha$-bitstrings. The upper and lower rows of $\alpha$-bitstring labels indicate their global and local indices within each process, respectively. Here we define a displacement index $d_p = \sum_{i=0}^{p-1} m_i$ to indicate the offset of $\alpha$-bitstrings assigned to each process. On a local process, the determinants are divided into segments (indicated by vertical arrows), where each segment contains all determinants sharing the same $\alpha$-bitstring.
  • Figure 2: Wall time (in seconds) for a single distributed matrix--vector multiplication across four benchmark systems. Open squares and filled squares denote the average and maximum wall times across nodes, respectively. Small boxed labels indicate $T_{\text{delay}}$ on the slowest process (in seconds). We plot the ideal scaling curve based on the initial average wall-time data point to quantify the overhead of our implementation.
  • Figure 3: Distributions of CI coefficient ratios $|c_K/c_{\mathrm{HF}}|$ within the TBSCI determinant spaces constructed from SCI-derived bitstrings. Bars report the number of determinants in each logarithmic bin. Different colors correspond to relative bitstring-weight thresholds $\delta$ used to select $\alpha$- and $\beta$-bitstrings, while the reference bars denote the FCI distribution under identical symmetry and frozen-core conditions.