Table of Contents
Fetching ...

SHAP zero Explains Biological Sequence Models with Near-zero Marginal Cost for Future Queries

Darin Tsui, Aryan Musharaf, Yigit Efe Erginbas, Justin Singh Kang, Amirali Aghazadeh

TL;DR

SHAP zero introduces a theoretical and algorithmic framework that combines sparse Fourier sketching with a $q$-ary Möbius transform to amortize SHAP-based explanations for biological sequence models. After a one-time sketching cost, the method enables near-zero marginal cost for explaining future queries, while preserving high-order interaction discovery through Faith-Shap. The approach yields substantial runtime reductions (up to 1000x in some cases) and recovers biologically meaningful motifs across TIGER, inDelphi, and Tranception, with strong agreement to KernelSHAP. This work broadens principled interpretability to large-scale sequence applications and highlights interdisciplinary connections between signal processing, coding theory, and algebraic geometry in ML explainability.

Abstract

The growing adoption of machine learning models for biological sequences has intensified the need for interpretable predictions, with Shapley values emerging as a theoretically grounded standard for model explanation. While effective for local explanations of individual input sequences, scaling Shapley-based interpretability to extract global biological insights requires evaluating thousands of sequences--incurring exponential computational cost per query. We introduce SHAP zero, a novel algorithm that amortizes the cost of Shapley value computation across large-scale biological datasets. After a one-time model sketching step, SHAP zero enables near-zero marginal cost for future queries by uncovering an underexplored connection between Shapley values, high-order feature interactions, and the sparse Fourier transform of the model. Applied to models of guide RNA efficacy, DNA repair outcomes, and protein fitness, SHAP zero explains predictions orders of magnitude faster than existing methods, recovering rich combinatorial interactions previously inaccessible at scale. This work opens the door to principled, efficient, and scalable interpretability for black-box sequence models in biology.

SHAP zero Explains Biological Sequence Models with Near-zero Marginal Cost for Future Queries

TL;DR

SHAP zero introduces a theoretical and algorithmic framework that combines sparse Fourier sketching with a -ary Möbius transform to amortize SHAP-based explanations for biological sequence models. After a one-time sketching cost, the method enables near-zero marginal cost for explaining future queries, while preserving high-order interaction discovery through Faith-Shap. The approach yields substantial runtime reductions (up to 1000x in some cases) and recovers biologically meaningful motifs across TIGER, inDelphi, and Tranception, with strong agreement to KernelSHAP. This work broadens principled interpretability to large-scale sequence applications and highlights interdisciplinary connections between signal processing, coding theory, and algebraic geometry in ML explainability.

Abstract

The growing adoption of machine learning models for biological sequences has intensified the need for interpretable predictions, with Shapley values emerging as a theoretically grounded standard for model explanation. While effective for local explanations of individual input sequences, scaling Shapley-based interpretability to extract global biological insights requires evaluating thousands of sequences--incurring exponential computational cost per query. We introduce SHAP zero, a novel algorithm that amortizes the cost of Shapley value computation across large-scale biological datasets. After a one-time model sketching step, SHAP zero enables near-zero marginal cost for future queries by uncovering an underexplored connection between Shapley values, high-order feature interactions, and the sparse Fourier transform of the model. Applied to models of guide RNA efficacy, DNA repair outcomes, and protein fitness, SHAP zero explains predictions orders of magnitude faster than existing methods, recovering rich combinatorial interactions previously inaccessible at scale. This work opens the door to principled, efficient, and scalable interpretability for black-box sequence models in biology.

Paper Structure

This paper contains 33 sections, 6 theorems, 60 equations, 10 figures, 10 tables.

Key Result

Proposition 3.2

(Fourier transform to $q$-ary Möbius transform). Given the top-$s$ Fourier coefficients $F[\mathbf{y}]$ with a maximum order of $\ell$ and the input query sequence $\mathbf{x}_i$, $M_{\mathbf{x}_i}[\mathbf{k}]$ is defined as: where the computational complexity of Equation (eq:mobius_to_fourier) scales with $\mathcal{O}(s^2 (2q)^{\ell})$.

Figures (10)

  • Figure 1: Overview of SHAP zero .a, SHAP zero pays a one-time cost to create a global Fourier sketch of $f$. This illustration shows $s=4$ Fourier coefficients strategically aliased into multiple subsampled transforms ($U_1$, $U_2$), and recovered by identifying singleton bins. b, For each future query $\mathbf{x}_1, \ldots, \mathbf{x}_Q$, SHAP zero localizes the global sketch via the Möbius transform of order $\ell$, capturing query-specific feature interactions. This maps to c, SHAP values, and d, Shapley interactions. By marginalizing the cost of future queries, SHAP zero enables e, scalable amortized explanations and f, discovery of biological motifs at unprecedented scale.
  • Figure 2: SHAP zero enables scalable amortized explanations in TIGER.a, TIGER wessels2024prediction predicts the guide score of $n=26$ length target sequences. b, The estimated top Fourier coefficients by SHAP zero outperform linear and pairwise models in predicting the guide scores in a held-out set. c, SHAP value estimates reveal high agreements ($\rho=0.83$) between SHAP zero and KernelSHAP. d, Total runtime of SHAP zero against KernelSHAP is marked by $\times$ in plots that depict the computational cost versus the number of explained sequences in both algorithms. e, Histogram of Faith-Shap interactions from SHAP zero compared to SHAP-IQ (see Appendix \ref{['appendix:results']}). f, Total runtime of SHAP zero versus SHAP-IQ in TIGER demonstrate that SHAP zero is more than 1000-fold faster.
  • Figure 3: SHAP zero reveals high-order motifs in inDelphi.a, inDelphi shen2018predictable predicts DNA repair outcomes in $n=40$ length sequences. b, Recovered Fourier coefficients outperform linear and pairwise models in repair outcomes in a held-out set. AUROC and AUPRC are not reported due to the regression nature of the model (see Appendix \ref{['appendix:indelphi']}). c, SHAP zero and KernelSHAP estimates reveal the importance of nucleotides around the cut site. d, Total runtime of SHAP zero versus KernelSHAP. e, Histogram of Faith-Shap interactions in SHAP zero compared to SHAP-IQ (see Appendix \ref{['appendix:results']}). High-order feature interactions identified by SHAP zero reveal the importance of microhomology patterns around the cut site (see Appendix \ref{['appendix:results']}). f, Total runtime of SHAP zero versus SHAP-IQ.
  • Figure 4: SHAP zero uncovers epistatic interactions in Tranception.a, We analyze the green florescence protein over $n=10$ epistatic sites from sarkisyan2016local. b, SHAP zero and KernelSHAP estimates ($\rho=0.97$) reveal the importance of secondary structure promoters. c, Total runtime of SHAP zero versus KernelSHAP. d, Heatmap of Faith-Shap interactions run over 200 sequences with SHAP zero and 50 sequences with SHAP-IQ. SHAP zero reveals numerous epistatic interactions with Proline (P) and Lysine (K). e, Total runtime of SHAP zero versus SHAP-IQ.
  • Figure 5: Top interactions in TIGER.a, Top 80 Faith-Shap interactions in TIGER with SHAP zero and b, SHAP-IQ. Although overlapping interactions in SHAP zero and SHAP-IQ are in agreement, SHAP zero interactions are more concentrated around the seed region.
  • ...and 5 more figures

Theorems & Definitions (10)

  • Definition 2.1
  • Definition 3.1
  • Proposition 3.2
  • Proposition 3.3
  • Theorem 3.4
  • Theorem 3.5
  • Lemma C.1
  • proof
  • Lemma C.2
  • proof