Table of Contents
Fetching ...

Where We Have Arrived in Proving the Emergence of Sparse Symbolic Concepts in AI Models

Qihan Ren, Jiayang Gao, Wen Shen, Quanshi Zhang

TL;DR

The work tackles whether DNNs can be faithfully explained by symbolic primitives. It defines interaction patterns via the Harsanyi dividend and proves that, under three broad conditions—bounded higher-order derivatives, monotonic improvement with reduced occlusion, and robustness to occlusion—a DNN's outputs decompose into a small set of salient, sparse interactions. Theoretical results bound the number of nonzero k-order interactions and show these Salient interactions are transferable across samples, with empirical validation across LLMs, vision, and point-cloud models. This provides a formal foundation for explainable AI via interaction primitives and offers practical implications for robustness and generalization, complemented by code for reproducibility.

Abstract

This study aims to prove the emergence of symbolic concepts (or more precisely, sparse primitive inference patterns) in well-trained deep neural networks (DNNs). Specifically, we prove the following three conditions for the emergence. (i) The high-order derivatives of the network output with respect to the input variables are all zero. (ii) The DNN can be used on occluded samples and when the input sample is less occluded, the DNN will yield higher confidence. (iii) The confidence of the DNN does not significantly degrade on occluded samples. These conditions are quite common, and we prove that under these conditions, the DNN will only encode a relatively small number of sparse interactions between input variables. Moreover, we can consider such interactions as symbolic primitive inference patterns encoded by a DNN, because we show that inference scores of the DNN on an exponentially large number of randomly masked samples can always be well mimicked by numerical effects of just a few interactions.

Where We Have Arrived in Proving the Emergence of Sparse Symbolic Concepts in AI Models

TL;DR

The work tackles whether DNNs can be faithfully explained by symbolic primitives. It defines interaction patterns via the Harsanyi dividend and proves that, under three broad conditions—bounded higher-order derivatives, monotonic improvement with reduced occlusion, and robustness to occlusion—a DNN's outputs decompose into a small set of salient, sparse interactions. Theoretical results bound the number of nonzero k-order interactions and show these Salient interactions are transferable across samples, with empirical validation across LLMs, vision, and point-cloud models. This provides a formal foundation for explainable AI via interaction primitives and offers practical implications for robustness and generalization, complemented by code for reproducibility.

Abstract

This study aims to prove the emergence of symbolic concepts (or more precisely, sparse primitive inference patterns) in well-trained deep neural networks (DNNs). Specifically, we prove the following three conditions for the emergence. (i) The high-order derivatives of the network output with respect to the input variables are all zero. (ii) The DNN can be used on occluded samples and when the input sample is less occluded, the DNN will yield higher confidence. (iii) The confidence of the DNN does not significantly degrade on occluded samples. These conditions are quite common, and we prove that under these conditions, the DNN will only encode a relatively small number of sparse interactions between input variables. Moreover, we can consider such interactions as symbolic primitive inference patterns encoded by a DNN, because we show that inference scores of the DNN on an exponentially large number of randomly masked samples can always be well mimicked by numerical effects of just a few interactions.
Paper Structure (35 sections, 10 theorems, 63 equations, 14 figures, 2 tables)

This paper contains 35 sections, 10 theorems, 63 equations, 14 figures, 2 tables.

Key Result

Theorem 1

Let the input sample $\bm{x}$ be arbitrarily masked to obtain a masked sample $\bm{x}_S$. The output of the DNN on masked sample $\bm{x}_S$ can be disentangled into the sum of all interaction effects within $S$: $\forall S \subseteq N, \ v(\bm{x}_S)=\sum\nolimits_{T \subseteq S} I(T) + v(\bm{x}_{\e

Figures (14)

  • Figure 1: Illustration of interactions encoded by a DNN. Each interaction $S$ corresponds to an AND relationship among a specific set $S$ of input variables (image patches). The patches $x_1$ and $x_4$ are masked, so that interactions $S_2$ and $S_3$ are deactivated.
  • Figure 2: Interactions extracted by PointNet++ on different samples in the ShapeNet dataset and their corresponding effects $I(S)$. Histograms on the right show the distribution of interaction effects $I(S)$ on different "motorbike" samples.
  • Figure 3: Normalized strength of different interactions, shown in descending order. Various DNNs trained for different tasks all encoded sparse interactions. In other words, only a relatively small number of interactions had a significant effect, while most interactions were noisy patterns and had near-zero effects, i.e., $I(S)\approx 0$.
  • Figure 4: Box-and-whisker diagram for the strength of interactions $|I(S)|$ of each order $m$. We tested different LLMs (OPT-1.3B, LLaMA-7B, and Aquila-7B) on the SQuAD dataset. Experiments show that high-order interactions on these networks were usually close to zero. Please see Appendix \ref{['apdx:exp_setting_LLM']} for experimental details and Appendix \ref{['apdx:results-M-order']} for results on more samples.
  • Figure 5: (a) Visualization of the monotonicity. Each curve shows the monotonic increase of the average output of the $m$-th order $\bar{u}^{(m)}$ with the order $m$ on a sample. The shaded area indicates the standard deviation of all $m$-order outputs on a sample, i.e., ${\rm Std}_{|S|=m}[u(S)]$. Note that the value of standard deviation does not affect our proof, because the proof only relies on the average output $\bar{u}^{(m)}$. (b) The average value of $p$ over different input samples, along with the standard deviation.
  • ...and 9 more figures

Theorems & Definitions (21)

  • Theorem 1: Proven in ren2021AOG and Appendix \ref{['apdx:proof_of_theorem1']}
  • Theorem 2: Proven in Appendix \ref{['apdx:proof_of_theorem5']}
  • Theorem 3: Proven in Appendix \ref{['apdx:proof_of_theorem6']}
  • Theorem 4: Connection to the Shapley value shapley1953value, proven in both harsanyi1963 and Appendix \ref{['apdx:proof_of_theorem_connect_shapley']}
  • Theorem 5: Connection to the Shapley interaction index grabisch1999axiomatic, proven in both ren2021AOG and Appendix \ref{['apdx:proof_of_theorem_connect_shapley_inter_index']}
  • Theorem 6: Connection to the Shapley Taylor interaction index sundararajan2020shapley, proven in both ren2021AOG and Appendix \ref{['apdx:proof_of_theorem_connect_shapley_taylor_inter_index']}
  • proof
  • Lemma 1
  • proof
  • proof
  • ...and 11 more