Thermodynamics-inspired Explanations of Artificial Intelligence

Shams Mehdi; Pratyush Tiwary

Thermodynamics-inspired Explanations of Artificial Intelligence

Shams Mehdi, Pratyush Tiwary

TL;DR

This work addresses the opacity of black-box AI by introducing TERP, a thermodynamics-inspired framework that yields interpretable explanations for predictions. It defines interpretation unfaithfulness $\\mathcal{U}$ and interpretation entropy $\\mathcal{S}$ and combines them into a free-energy objective $\\zeta = \\mathcal{U} + \\theta \\mathcal{S}$ to select a unique, optimal explanation. Through model-agnostic surrogates and LDA-based neighborhood similarity, TERP is demonstrated across molecular dynamics (VAMPnets), image classification (Vision Transformers), and text classification (Att-BLSTM), producing compact, faithful rationales that align with domain knowledge. The framework offers a principled, tunable trade-off between faithfulness and interpretability, enabling trustworthy deployment of AI in critical scientific tasks and beyond.

Abstract

In recent years, predictive machine learning methods have gained prominence in various scientific domains. However, due to their black-box nature, it is essential to establish trust in these models before accepting them as accurate. One promising strategy for assigning trust involves employing explanation techniques that elucidate the rationale behind a black-box model's predictions in a manner that humans can understand. However, assessing the degree of human interpretability of the rationale generated by such methods is a nontrivial challenge. In this work, we introduce interpretation entropy as a universal solution for assessing the degree of human interpretability associated with any linear model. Using this concept and drawing inspiration from classical thermodynamics, we present Thermodynamics-inspired Explainable Representations of AI and other black-box Paradigms (TERP), a method for generating accurate, and human-interpretable explanations for black-box predictions in a model-agnostic manner. To demonstrate the wide-ranging applicability of TERP, we successfully employ it to explain various black-box model architectures, including deep learning Autoencoders, Recurrent Neural Networks, and Convolutional Neural Networks, across diverse domains such as molecular simulations, text, and image classification.

Thermodynamics-inspired Explanations of Artificial Intelligence

TL;DR

and interpretation entropy

and combines them into a free-energy objective

to select a unique, optimal explanation. Through model-agnostic surrogates and LDA-based neighborhood similarity, TERP is demonstrated across molecular dynamics (VAMPnets), image classification (Vision Transformers), and text classification (Att-BLSTM), producing compact, faithful rationales that align with domain knowledge. The framework offers a principled, tunable trade-off between faithfulness and interpretability, enabling trustworthy deployment of AI in critical scientific tasks and beyond.

Abstract

Paper Structure (19 sections, 9 equations, 5 figures, 1 algorithm)

This paper contains 19 sections, 9 equations, 5 figures, 1 algorithm.

Introduction
Results
Interpretation Unfaithfulness ($\mathcal{U}$) for surrogate model construction
Interpretation Entropy ($\mathcal{S}$) for model selection
Free Energy ($\zeta$) for optimal explanation
Application to AI-augmented MD: VAMPnets
Application to Image classification: Vision Transformers (ViTs)
Application to Text classification: Attention-based Bidirectional Long Short-Term Memory (Att-BLSTM)
Discussion
Methods
Neighborhood generation
AI-augmented MD method: VAMPnets
Image classification: Vision Transformers (ViTs)
Text classification: Attention-based Bidirectional Long Short-Term Memory (Att-BLSTM)
Competing Interests
...and 4 more sections

Figures (5)

Figure 1: Model complexity is not a good descriptor for human interpretability. Illustrative linear models (a), & (b) predict a target with the same level of accuracy. Both have the same number of model parameters (six), however, model (b) is significantly more human-interpretable than model (a). In model (b), two out of the six features stand out as most relevant for making predictions while it's difficult to identify relevant features for model (a).
Figure 2: Illustrative example highlighting properties of free energy $\zeta^j$, unfaithfulness $\mathcal{U}^j$, and interpretation entropy $\mathcal{S}^j$. (a) Strength of $\mathcal{S}^j$ contribution to $\zeta^j$ can be tuned using $\theta$. $\zeta^j$ vs. $j$ plots for three different $\theta=37.2,16.9,56.0$ are shown resulting in a minimum at $j=12,14,8$ respectively. (b) $\mathcal{U}^j$ vs. $j$ remains unaffected by $\theta$. (c) $\mathcal{S}^j$ vs. $j$ plot shows that the strength of the trade-off can be tuned by $\theta$.
Figure 3: Using TERP to explain VAMPnets for molecular dynamics simulations of alanine dipeptide in vacuum. (a) Representative conformational states of alanine dipeptide labelled I, II, III. Projected converged states are highlighted in three different colors as obtained by VAMPnets along (b) ($\phi,\psi$) dihedral angles. $713$ different configurations are chosen for TERP and the first, and second dominant features are highlighted using colored ($\star$). (d) Relative feature importance of a specific example point A. (e) High dimensional neighborhood data projected onto 1-d using LDA for improved similarity measure. Binarizing the class prediction probabilities of the neighborhood using a threshold of $0.5$ results in explanation and not explanation classes respectively. The LDA projection separates the two regimes of prediction probability, showing meaningful projection. Average similarity error, $\Delta Pi$ (Eq.\ref{['eq:sim']}) per datapoint for (f) Euclidean, and (g) LDA based similarity respectively. Comparison between (f) and (g) shows minimal error for LDA based similarity, specifically demonstrated for an input space constructed from the four dihedral angles plus one pure noise, four pure noise, and four correlated features with partial noise respectively. The input space for no actual data and four pure noise features in (f) establishes a baseline, showing that the Euclidean similarity will include significant error even when one redundant feature is included. All the calculations were performed in 100 independent trials to appropriately examine the effects.
Figure 4: Using TERP to explain and check the reliability of a ViT trained on CelebA dataset. (a) ViT predicts the presence of 'Eyeglasses' in this image with a probability of $0.998$. (b) Superpixel definitions for the test image following the 16x16 pixel definition of ViT patches. TERP results showcasing (c) $\mathcal{U}^j$, (d) $\mathcal{S}^j$, (e) $\zeta^j$, and (f) $\theta^j$ as a function of $j$, (g) corresponding TERP explanation. We can see the maximal drop in $\theta^j$ happens when going from $j=2$ to $j=3$. By defining the optimal temperature $\theta^o=\frac{\theta^{j=2}+\theta^{j=3}}{2}$ as discussed in Sec. \ref{['sec:optimality']} a minimum in $\zeta^j$ is observed at $j=3$. Panels (h-j) show sanity checksadebayo2018sanity i.e, the result of an AI explanation scheme should be sensitive under model parameter randomization (h,i) and data randomization (j). (k) Saliency map results as baseline explanation for 'Eyeglasses' prediction. Red color highlights pixels with high absolute values of the class probability gradient across RGB channels. High gradient at pixels not relevant to 'Eyeglasses' shows limitation of saliency map explanation. (i,j) Shows TERP and saliency map explanations for the class 'Male'. $\mathcal{U}^j$, $\mathcal{S}^j$, $\zeta^j$, and $\theta^j$ as a function of $j$ plots are provided in the SI.
Figure 5: Using TERP to explain and check the reliability of Att-BLSTM model trained on AG's news corpus to predict the news story titled "AI predicts protein structures". (a) relative feature importance of the two most influential keywords: 'science', and 'species' as identified by TERP, (b) $\mathcal{U}^j$, (c) $\mathcal{S}^j$, (d) $\theta^j$, (e) $\zeta^j$ vs. $j$ plots showing the optimal explanation appears at $j=2$, due to the maximum drop in $\theta^j$ as $j$ is increased from $1$ to $2$.

Thermodynamics-inspired Explanations of Artificial Intelligence

TL;DR

Abstract

Thermodynamics-inspired Explanations of Artificial Intelligence

Authors

TL;DR

Abstract

Table of Contents

Figures (5)