Learning to Understand: Identifying Interactions via the Möbius Transform

Justin S. Kang; Yigit E. Erginbas; Landon Butler; Ramtin Pedarsani; Kannan Ramchandran

Learning to Understand: Identifying Interactions via the Möbius Transform

Justin S. Kang, Yigit E. Erginbas, Landon Butler, Ramtin Pedarsani, Kannan Ramchandran

TL;DR

This work proposes Sparse Möbius Transform (SMT) to identify high-order interactions in functions by exploiting sparsity in Möbius coefficients. By combining subsampling (aliasing) with non-adaptive group testing and a peeling message-passing algorithm, SMT achieves exact Möbius reconstruction with near-linear sample complexity $O(Kn)$ and near-quadratic time $O(Kn^2)$ under uniform-interaction assumptions, and $O(Kt\log n)$ samples with $t$-degree interactions in the low-degree setting, even in the presence of noise. The approach yields more faithful explanations than Shapley or Banzhaf values on several real-model tasks (e.g., breast cancer, sentiment analysis, and QA) given the same number of terms, highlighting the practical impact for model interpretability and data valuation. These results integrate ideas from sparse signal processing, coding theory, and group testing to deliver a scalable, non-adaptive framework for uncovering meaningful input interactions in complex models.

Abstract

One of the key challenges in machine learning is to find interpretable representations of learned functions. The Möbius transform is essential for this purpose, as its coefficients correspond to unique importance scores for sets of input variables. This transform is closely related to widely used game-theoretic notions of importance like the Shapley and Bhanzaf value, but it also captures crucial higher-order interactions. Although computing the obius Transform of a function with $n$ inputs involves $2^n$ coefficients, it becomes tractable when the function is sparse and of low-degree as we show is the case for many real-world functions. Under these conditions, the complexity of the transform computation is significantly reduced. When there are $K$ non-zero coefficients, our algorithm recovers the Möbius transform in $O(Kn)$ samples and $O(Kn^2)$ time asymptotically under certain assumptions, the first non-adaptive algorithm to do so. We also uncover a surprising connection between group testing and the Möbius transform. For functions where all interactions involve at most $t$ inputs, we use group testing results to compute the Möbius transform with $O(Kt\log n)$ sample complexity and $O(K\mathrm{poly}(n))$ time. A robust version of this algorithm withstands noise and maintains this complexity. This marks the first $n$ sub-linear query complexity, noise-tolerant algorithm for the Möbius transform. In several examples, we observe that representations generated via sparse Möbius transform are up to twice as faithful to the original function, as compared to Shaply and Banzhaf values, while using the same number of terms.

Learning to Understand: Identifying Interactions via the Möbius Transform

TL;DR

and near-quadratic time

under uniform-interaction assumptions, and

samples with

-degree interactions in the low-degree setting, even in the presence of noise. The approach yields more faithful explanations than Shapley or Banzhaf values on several real-model tasks (e.g., breast cancer, sentiment analysis, and QA) given the same number of terms, highlighting the practical impact for model interpretability and data valuation. These results integrate ideas from sparse signal processing, coding theory, and group testing to deliver a scalable, non-adaptive framework for uncovering meaningful input interactions in complex models.

Abstract

inputs involves

coefficients, it becomes tractable when the function is sparse and of low-degree as we show is the case for many real-world functions. Under these conditions, the complexity of the transform computation is significantly reduced. When there are

non-zero coefficients, our algorithm recovers the Möbius transform in

samples and

time asymptotically under certain assumptions, the first non-adaptive algorithm to do so. We also uncover a surprising connection between group testing and the Möbius transform. For functions where all interactions involve at most

inputs, we use group testing results to compute the Möbius transform with

sample complexity and

time. A robust version of this algorithm withstands noise and maintains this complexity. This marks the first

sub-linear query complexity, noise-tolerant algorithm for the Möbius transform. In several examples, we observe that representations generated via sparse Möbius transform are up to twice as faithful to the original function, as compared to Shaply and Banzhaf values, while using the same number of terms.

Paper Structure (56 sections, 21 theorems, 109 equations, 17 figures, 1 table, 1 algorithm)

This paper contains 56 sections, 21 theorems, 109 equations, 17 figures, 1 table, 1 algorithm.

Introduction
Defining the Möbius Transform
Related Works and Applications
Main Contributions
Notation
Understanding Assumptions: Sparsity and Low Degree
Algorithm Overview
Subsampling and Aliasing
Message Passing to Resolve Collisions
Singleton Detection and Identification
Singleton Detection and Identification
Singleton Identification in the Low-Degree Setting
Extension to Noisy Setting
Results
Synthetic simulations
...and 41 more sections

Key Result

Lemma 3.1

Choose $\mathbf m_{\boldsymbol{\ell}} = \overline{\mathbf H^{\textrm{T}}\overline{\boldsymbol{\ell}}}$, which results in $\mathcal{A}(\mathbf j) = \{ \mathbf k : \mathbf H\mathbf k = \mathbf j\}$. $\mathbf H$ is chosen as follows: If chosen this way, non-zero indices are mapped to the $2^b$ sampling sets $\mathcal{A}(\mathbf j)$ independently and uniformly at random asymptotically, thus maximizin

Figures (17)

Figure 1: The movie review "Her acting never fails to impress" is passed into a BERT language model fine-tuned to do sentiment analysis perez2021pysentimiento. Presented are $1^{st}$, $2^{nd}$ and $3^{rd}$ order Möbius coefficients, with positive interactions in green and negative in red computed via \ref{['eq:inverse_transform']}. The coefficients explain how groups of words influence BERT's perception of sentiment. For instance, while never and fails have strong negative sentiments individually, when combined, they impose a profound positive sentiment. In the second row, the word never is deleted, resulting in a large change in sentiment. In contrast, the Shapley values of each word $\mathop{\mathrm{SV}}\nolimits(\cdot)$, presented at the bottom of the figure, are less informative.
Figure 2: These plots are strong indicators that sparsity and low-degree assumptions are worthy of consideration. We consider three different learning tasks. The left-most plot shows results from an XGBoost Chen:2016:XST:2939672.2939785 model used for breast cancer diagnosis. The middle plot shows results from word-level sentiment analysis task using a BERT model perez2021pysentimiento like in Fig. \ref{['fig:sentiment_mobius']}. The right-most plot shows results from a multiple choice question and answer task also using a BERT model bertMC. Error bars represent standard deviation over 10 different instances. Details for each setting are in Appendix \ref{['apdx:fig_examples']}. In all cases, the number of features $n \approx 20$, for which it is possible to perform the full Möbius transform. On the top row, we plot achievable faithfulness $R^2$ as a function of sparsity. We observe that in all cases, faithfulness approaching $1$ requires only a few thousand Möbius coefficients, motivating our sparsity assumption. The bottom row of plots considers achievable faithfulness vs. degree, i.e., what $R^2$ can be achieved using only Möbius coefficients $\hat{F}$ up to a given degree. Here we observe that in nearly all cases, low-degree coefficients suffice to get quite small $R^2$, motivating our low-degree assumption.
Figure 3: This figure considers a "sparsified" version of the Möbius coefficients depicted in Fig \ref{['fig:sentiment_mobius']}, keeping only the largest 4 depicted. Two different sampling choices are shown, as well as the resulting aliasing sets. In the first aliasing set, there is one zeroton, two singletons, and one multiton. In the second aliasing set, there are two zerotons, one singleton, and one multiton.
Figure 4: Depiction of our peeling message passing algorithm for the samples in Fig. \ref{['fig:example_pt1']}. The singleton in $U_2(01)$ is subtracted (peeled) so we can resolve $F(\mathbf k_2)$ from $U_1(11)$.
Figure 5: (a) Perfect reconstruction against $n$ and sample complexity under Assumption \ref{['ass:unif']}. Holding $C=3$, we scale $b$ to increase the sample complexity. We observe that the number of samples required to achieve perfect reconstruction is scaling linearly in $n$ as predicted. (b) Plot of the noise-robust version of our algorithm. For various values of $t$, we set $n=500$ and $K=500$, using a group testing matrix with $P=1000$. We plot the performance of our algorithm against SNR, measured in terms of the $R^2$. Error bands represent the standard deviation over $10$ runs. (c) Runtime comparison of SMT, SHAP-IQ fumagalli2023shapiq, and $t=5$ order FSI via LASSO tsai2023faith. All are computing the Möbius transform in the setting where all non-zero interactions are order $t$, $K=10$. SMT easily outperforms both, while the other methods become intractable. Error bands represent standard deviation over $10$ runs.
...and 12 more figures

Theorems & Definitions (30)

Lemma 3.1
Lemma 4.1
Theorem 5.1
Theorem 5.2
Lemma C.1
Definition C.2
Lemma C.3
Lemma C.4
proof
Theorem C.4
...and 20 more

Learning to Understand: Identifying Interactions via the Möbius Transform

TL;DR

Abstract

Learning to Understand: Identifying Interactions via the Möbius Transform

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (17)

Theorems & Definitions (30)