Table of Contents
Fetching ...

Concept frustration: Aligning human concepts and machine representations

Enrico Parisini, Christopher J. Soelistyo, Ahab Isaac, Alessandro Barp, Christopher R. S. Banerji

Abstract

Aligning human-interpretable concepts with the internal representations learned by modern machine learning systems remains a central challenge for interpretable AI. We introduce a geometric framework for comparing supervised human concepts with unsupervised intermediate representations extracted from foundation model embeddings. Motivated by the role of conceptual leaps in scientific discovery, we formalise the notion of concept frustration: a contradiction that arises when an unobserved concept induces relationships between known concepts that cannot be made consistent within an existing ontology. We develop task-aligned similarity measures that detect concept frustration between supervised concept-based models and unsupervised representations derived from foundation models, and show that the phenomenon is detectable in task-aligned geometry while conventional Euclidean comparisons fail. Under a linear-Gaussian generative model we derive a closed-form expression for Bayes-optimal concept-based classifier accuracy, decomposing predictive signal into known-known, known-unknown and unknown-unknown contributions and identifying analytically where frustration affects performance. Experiments on synthetic data and real language and vision tasks demonstrate that frustration can be detected in foundation model representations and that incorporating a frustrating concept into an interpretable model reorganises the geometry of learned concept representations, to better align human and machine reasoning. These results suggest a principled framework for diagnosing incomplete concept ontologies and aligning human and machine conceptual reasoning, with implications for the development and validation of safe interpretable AI for high-risk applications.

Concept frustration: Aligning human concepts and machine representations

Abstract

Aligning human-interpretable concepts with the internal representations learned by modern machine learning systems remains a central challenge for interpretable AI. We introduce a geometric framework for comparing supervised human concepts with unsupervised intermediate representations extracted from foundation model embeddings. Motivated by the role of conceptual leaps in scientific discovery, we formalise the notion of concept frustration: a contradiction that arises when an unobserved concept induces relationships between known concepts that cannot be made consistent within an existing ontology. We develop task-aligned similarity measures that detect concept frustration between supervised concept-based models and unsupervised representations derived from foundation models, and show that the phenomenon is detectable in task-aligned geometry while conventional Euclidean comparisons fail. Under a linear-Gaussian generative model we derive a closed-form expression for Bayes-optimal concept-based classifier accuracy, decomposing predictive signal into known-known, known-unknown and unknown-unknown contributions and identifying analytically where frustration affects performance. Experiments on synthetic data and real language and vision tasks demonstrate that frustration can be detected in foundation model representations and that incorporating a frustrating concept into an interpretable model reorganises the geometry of learned concept representations, to better align human and machine reasoning. These results suggest a principled framework for diagnosing incomplete concept ontologies and aligning human and machine conceptual reasoning, with implications for the development and validation of safe interpretable AI for high-risk applications.

Paper Structure

This paper contains 38 sections, 4 theorems, 97 equations, 7 figures, 4 tables.

Key Result

Theorem 1

Consider the black-box model defined above with sigmoid output. The Fisher information metric on activation space satisfies where $p(\mathbf{a})\equiv p(y=1|\mathbf{a}) = \sigma(l(\mathbf{a}))$ and $g(\mathbf{a}) = \nabla_\mathbf{a} l(\mathbf{a})$.

Figures (7)

  • Figure 1: The importance of concept frustration. Consider a flatworlder training a model to predict a position on Earth from a complex signal (such as from a satellite), in an interpretable manner. They suggest two known concepts as $C_1$=distance form the North Pole and $C_2$=distance from the South pole. They expect these to be anti-correlated. Their trained model is highly accurate of position, but reasons using an unknown concept $C_3$, which frustrates their known inter-concept semantics. Increasing $C_3$ forces $C_1$ and $C_2$ to be positively correlated. Once $C_3$ is understood as distance to the core of a curved Earth, the frustration makes sense and known understanding of the system is improved.
  • Figure 2: Overview of approach. (A) Raw data is typically processed by a feature extractor (such as a foundation model) to produce an activation vector. (B) Activations can be used to predict an outcome reasoned on a set of intermediate concepts some which may be known prior to predicting the task. This task can be approached using (C) a black box model, (D) a sparse autoencoder (E) a concept bottleneck model.
  • Figure 3: The treasure hunter task and frustration. A. Consider that we receive a complex signal encoding the position of a treasure, from this signal we wish to predict the binary task of identifying whether a treasure is in a fixed search radius of our position. We employ an interpretable model where known concepts $C_1$ and $C_2$ are the distance to the North and South polar lines parallel to the Earth's surface, and we consider the unknown concept $C_3$ to denote depth below the surface. We consider two geometries: a flat Earth in which there is no frustration between known and unknown concepts, and a round Earth in which there is frustration. B. Box and violin plots display $\gamma_{F_A}$ and $\gamma_E$ calculated from 50 simulations of the flat and round Earth geometry treasure hunter tasks, alongside paired Wilcoxon $p$-values and 95% confidence intervals of the median difference between round and flat Earth metric values. We see that $\gamma_{F_A}$ detects frustration while $\gamma_{E}$ does not.
  • Figure 4: CBM accuracy and interpretability are degraded under frustration. Box and violin plots display (A) accuracy of black box classifiers (B) accuracy of CBM classifiers (C) concept mean-square error (MSE) for CBM classifiers and (D) concept semantic fidelity metric $\beta$ for $\alpha \in \{-1,0,1\}$, calculated from 600 simulations of our synthetic data generator across a range of parameter regimes, alongside paired Wilcoxon $p$-values and 95% confidence intervals of the median difference between $\alpha$ values. We see that frustration significantly degrades accuracy and interpretability of CBMs without making the task more difficult for a non-interpretable model
  • Figure 5: $\gamma_{F_A}$ detects frustration in inter-concept covariance and concept task weights between known and unknown concepts Box and violin plots display $\gamma_{F_A}$ and $\gamma_E$ calculated from 600 simulations of our synthetic data model across 60 parameter regimes separated by (A) $\alpha$ and (B) the sign of $T_2$ the theoretical cross alignment term between known and unknown signal. Paired Wilcoxon $p$-values and 95% confidence intervals of the median difference between metric values across various parameter groups are displayed. We see that $\gamma_{F_A}$ is elevated when concept covariance between known and unknown concepts is frustrated, while $\gamma_{E}$ is not. Similarly we see that $\gamma_{F_A}$ is consistently elevated when the cross alignment term $T_2$ is non-zero, while $\gamma_E$ is not.
  • ...and 2 more figures

Theorems & Definitions (11)

  • Theorem 1: Closed-form Fisher metric for one-hidden-layer binary model
  • Definition 1: Concept triplet frustration
  • Definition 2: Maximally frustrating direction
  • Definition 3: Pairwise frustration
  • Definition 4: Global frustration
  • Theorem 2: Concept-optimal Accuracy in the Linear–Gaussian Model
  • Definition 5: Semantic fidelity
  • Theorem : Closed-form Fisher metric for one-hidden-layer binary model
  • proof
  • Theorem : Concept-optimal Accuracy in the Linear–Gaussian Model
  • ...and 1 more