SHAP zero Explains Biological Sequence Models with Near-zero Marginal Cost for Future Queries
Darin Tsui, Aryan Musharaf, Yigit Efe Erginbas, Justin Singh Kang, Amirali Aghazadeh
TL;DR
SHAP zero introduces a theoretical and algorithmic framework that combines sparse Fourier sketching with a $q$-ary Möbius transform to amortize SHAP-based explanations for biological sequence models. After a one-time sketching cost, the method enables near-zero marginal cost for explaining future queries, while preserving high-order interaction discovery through Faith-Shap. The approach yields substantial runtime reductions (up to 1000x in some cases) and recovers biologically meaningful motifs across TIGER, inDelphi, and Tranception, with strong agreement to KernelSHAP. This work broadens principled interpretability to large-scale sequence applications and highlights interdisciplinary connections between signal processing, coding theory, and algebraic geometry in ML explainability.
Abstract
The growing adoption of machine learning models for biological sequences has intensified the need for interpretable predictions, with Shapley values emerging as a theoretically grounded standard for model explanation. While effective for local explanations of individual input sequences, scaling Shapley-based interpretability to extract global biological insights requires evaluating thousands of sequences--incurring exponential computational cost per query. We introduce SHAP zero, a novel algorithm that amortizes the cost of Shapley value computation across large-scale biological datasets. After a one-time model sketching step, SHAP zero enables near-zero marginal cost for future queries by uncovering an underexplored connection between Shapley values, high-order feature interactions, and the sparse Fourier transform of the model. Applied to models of guide RNA efficacy, DNA repair outcomes, and protein fitness, SHAP zero explains predictions orders of magnitude faster than existing methods, recovering rich combinatorial interactions previously inaccessible at scale. This work opens the door to principled, efficient, and scalable interpretability for black-box sequence models in biology.
