Table of Contents
Fetching ...

Hard to Explain: On the Computational Hardness of In-Distribution Model Interpretation

Guy Amir, Shahaf Bassan, Guy Katz

TL;DR

This work argues that interpretability should be analyzed through the lens of computational hardness, incorporating the data distribution via an OOD detector to ensure socially aligned explanations. It develops an abstract explainability query framework, Q, that can express misaligned and distribution-aligned explanations and shows that, in general, obtaining socially aligned explanations is at least as hard as interpreting the distribution detector, with MLP-based models often governing the hardness. The authors establish general theorems linking Q( C_M, C_c) to Q( C_c) under constructibility assumptions, and demonstrate model-specific results: FBDDs and MLPs are self-aligned (allowing a single model to encode alignment), whereas Perceptrons may not be self-aligned, implying that some model classes require external distribution-aware components to achieve aligned explanations. Together, these results illuminate fundamental limits and guide future work toward robust, distribution-aware interpretability, including potential approximations and broader model classes.

Abstract

The ability to interpret Machine Learning (ML) models is becoming increasingly essential. However, despite significant progress in the field, there remains a lack of rigorous characterization regarding the innate interpretability of different models. In an attempt to bridge this gap, recent work has demonstrated that it is possible to formally assess interpretability by studying the computational complexity of explaining the decisions of various models. In this setting, if explanations for a particular model can be obtained efficiently, the model is considered interpretable (since it can be explained ``easily''). However, if generating explanations over an ML model is computationally intractable, it is considered uninterpretable. Prior research identified two key factors that influence the complexity of interpreting an ML model: (i) the type of the model (e.g., neural networks, decision trees, etc.); and (ii) the form of explanation (e.g., contrastive explanations, Shapley values, etc.). In this work, we claim that a third, important factor must also be considered for this analysis -- the underlying distribution over which the explanation is obtained. Considering the underlying distribution is key in avoiding explanations that are socially misaligned, i.e., convey information that is biased and unhelpful to users. We demonstrate the significant influence of the underlying distribution on the resulting overall interpretation complexity, in two settings: (i) prediction models paired with an external out-of-distribution (OOD) detector; and (ii) prediction models designed to inherently generate socially aligned explanations. Our findings prove that the expressiveness of the distribution can significantly influence the overall complexity of interpretation, and identify essential prerequisites that a model must possess to generate socially aligned explanations.

Hard to Explain: On the Computational Hardness of In-Distribution Model Interpretation

TL;DR

This work argues that interpretability should be analyzed through the lens of computational hardness, incorporating the data distribution via an OOD detector to ensure socially aligned explanations. It develops an abstract explainability query framework, Q, that can express misaligned and distribution-aligned explanations and shows that, in general, obtaining socially aligned explanations is at least as hard as interpreting the distribution detector, with MLP-based models often governing the hardness. The authors establish general theorems linking Q( C_M, C_c) to Q( C_c) under constructibility assumptions, and demonstrate model-specific results: FBDDs and MLPs are self-aligned (allowing a single model to encode alignment), whereas Perceptrons may not be self-aligned, implying that some model classes require external distribution-aware components to achieve aligned explanations. Together, these results illuminate fundamental limits and guide future work toward robust, distribution-aware interpretability, including potential approximations and broader model classes.

Abstract

The ability to interpret Machine Learning (ML) models is becoming increasingly essential. However, despite significant progress in the field, there remains a lack of rigorous characterization regarding the innate interpretability of different models. In an attempt to bridge this gap, recent work has demonstrated that it is possible to formally assess interpretability by studying the computational complexity of explaining the decisions of various models. In this setting, if explanations for a particular model can be obtained efficiently, the model is considered interpretable (since it can be explained ``easily''). However, if generating explanations over an ML model is computationally intractable, it is considered uninterpretable. Prior research identified two key factors that influence the complexity of interpreting an ML model: (i) the type of the model (e.g., neural networks, decision trees, etc.); and (ii) the form of explanation (e.g., contrastive explanations, Shapley values, etc.). In this work, we claim that a third, important factor must also be considered for this analysis -- the underlying distribution over which the explanation is obtained. Considering the underlying distribution is key in avoiding explanations that are socially misaligned, i.e., convey information that is biased and unhelpful to users. We demonstrate the significant influence of the underlying distribution on the resulting overall interpretation complexity, in two settings: (i) prediction models paired with an external out-of-distribution (OOD) detector; and (ii) prediction models designed to inherently generate socially aligned explanations. Our findings prove that the expressiveness of the distribution can significantly influence the overall complexity of interpretation, and identify essential prerequisites that a model must possess to generate socially aligned explanations.
Paper Structure (30 sections, 27 theorems, 56 equations, 5 figures, 1 table)

This paper contains 30 sections, 27 theorems, 56 equations, 5 figures, 1 table.

Key Result

Theorem 1

If $\mathbf{1}\in \mathcal{C}_{\pi}$ then $\mathbf{Q}(\mathcal{C}_{\mathcal{M}})\leq_{p} \mathbf{Q}(\mathcal{C}_{\mathcal{M}},\mathcal{C}_{\pi})$.

Figures (5)

  • Figure 1: A visual illustration of Theorems \ref{['generalized_theorem_1']}, \ref{['generalized_theorem_2']}, and \ref{['theorem_mlp_always_wins']}. Dashed lines depict that both queries are in the same complexity class, and are hard for that class. Arrows are directed from the query with the "easier" complexity class to the query with the "harder" complexity class.
  • Figure 2: An illustration of the naive constructability of a Perceptron model, indicating the value $[1,1,0,1,0]$, and $w^+=1$, $w^-=(-1)$. The bias term is [$- \sum_{1\leq i\leq n}(h^{1}_i\cdot x_i)] + 0.5=(-3)+0.5=(-2.5)$.
  • Figure 3: An illustration of the polynomial construction $f':=f_t \rightarrow f_s$, relying on $f_t, f_s\in \mathcal{C}_{\text{FBDD}\xspace{}}$. For $f_s$ and $f_t$ the dashed lines represent paths that end with a "0" leaf node, while solid lines represent paths that end with a "1" leaf node.
  • Figure 4: An illustration of the polynomial construction of $f_k\in \mathcal{C}_{\text{FBDD}\xspace{}}$, given $f'$. The blue boxes represent an area pruned during our recursive procedure, in order to construct a valid FBDD, without a repetition of features.
  • Figure 5: An illustration of the effective construction $f_t \lor f_s\in \mathcal{C}_{\text{MLP}\xspace{}}$, relying on $f_t, f_s\in \mathcal{C}_{\text{MLP}\xspace{}}$.

Theorems & Definitions (35)

  • Definition 1
  • Definition 2
  • Definition 3
  • Theorem 1
  • Theorem 2
  • Proposition 1
  • Theorem 3
  • Definition 4
  • Theorem 4
  • Proposition 2
  • ...and 25 more