Table of Contents
Fetching ...

Abstraction Alignment: Comparing Model-Learned and Human-Encoded Conceptual Relationships

Angie Boggust, Hyemin Bang, Hendrik Strobelt, Arvind Satyanarayan

TL;DR

Abstraction alignment provides a structured, graph-based framework to assess how closely a model's learned abstractions align with formal human knowledge. By mapping model outputs to a human abstraction graph and propagating probabilities through hierarchical concepts, it yields a fitted abstraction graph and three metrics—Abstraction Match, Subgraph Preference, and Concept Co-confusion—across datasets. The approach is implemented in an interactive interface and validated through evaluative case studies in computer vision, NLP specificity benchmarks, and participatory medical dataset auditing, revealing misalignments unseen by traditional probes and guiding refinements to both data and abstractions. This work broadens interpretability and model audit capabilities, enabling domain experts to test alignment hypotheses at scale and across modalities, with open-source tooling to support real-world deployment.

Abstract

While interpretability methods identify a model's learned concepts, they overlook the relationships between concepts that make up its abstractions and inform its ability to generalize to new data. To assess whether models' have learned human-aligned abstractions, we introduce abstraction alignment, a methodology to compare model behavior against formal human knowledge. Abstraction alignment externalizes domain-specific human knowledge as an abstraction graph, a set of pertinent concepts spanning levels of abstraction. Using the abstraction graph as a ground truth, abstraction alignment measures the alignment of a model's behavior by determining how much of its uncertainty is accounted for by the human abstractions. By aggregating abstraction alignment across entire datasets, users can test alignment hypotheses, such as which human concepts the model has learned and where misalignments recur. In evaluations with experts, abstraction alignment differentiates seemingly similar errors, improves the verbosity of existing model-quality metrics, and uncovers improvements to current human abstractions.

Abstraction Alignment: Comparing Model-Learned and Human-Encoded Conceptual Relationships

TL;DR

Abstraction alignment provides a structured, graph-based framework to assess how closely a model's learned abstractions align with formal human knowledge. By mapping model outputs to a human abstraction graph and propagating probabilities through hierarchical concepts, it yields a fitted abstraction graph and three metrics—Abstraction Match, Subgraph Preference, and Concept Co-confusion—across datasets. The approach is implemented in an interactive interface and validated through evaluative case studies in computer vision, NLP specificity benchmarks, and participatory medical dataset auditing, revealing misalignments unseen by traditional probes and guiding refinements to both data and abstractions. This work broadens interpretability and model audit capabilities, enabling domain experts to test alignment hypotheses at scale and across modalities, with open-source tooling to support real-world deployment.

Abstract

While interpretability methods identify a model's learned concepts, they overlook the relationships between concepts that make up its abstractions and inform its ability to generalize to new data. To assess whether models' have learned human-aligned abstractions, we introduce abstraction alignment, a methodology to compare model behavior against formal human knowledge. Abstraction alignment externalizes domain-specific human knowledge as an abstraction graph, a set of pertinent concepts spanning levels of abstraction. Using the abstraction graph as a ground truth, abstraction alignment measures the alignment of a model's behavior by determining how much of its uncertainty is accounted for by the human abstractions. By aggregating abstraction alignment across entire datasets, users can test alignment hypotheses, such as which human concepts the model has learned and where misalignments recur. In evaluations with experts, abstraction alignment differentiates seemingly similar errors, improves the verbosity of existing model-quality metrics, and uncovers improvements to current human abstractions.
Paper Structure (43 sections, 3 equations, 6 figures, 2 tables)

This paper contains 43 sections, 3 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: To compare model behavior with human abstractions, abstraction alignment computes a fitted abstraction graph for each model decision. First, we map the model's output space to concepts in the human abstraction graph. Then, we assign each concept a value corresponding to the model's confidence in that concept or any of its descendants. The resulting fitted abstraction graph represents the model's confidence in a range of concepts across levels of abstraction.
  • Figure 2: The abstraction alignment interface visualizes a model's alignment with human abstractions. It displays the cumulative fitted abstraction graph (A), aggregated abstractionmatch (B), concept distribution (C), and conceptco-confusion (E). Interacting with these panels or the query bar (D) updates the instance list (F) to show the fitted abstraction graphs of relevant inputs.
  • Figure 3: Interacting with the abstraction alignment interface allows users to explore alignment hypotheses. Users can select a concept (A–C), a concept pair (E), or define an alignment query (D) to update the interface with relevant dataset instances.
  • Figure 4: Abstraction alignment offers insights into the behavior of an image classification model. Querying alignment patterns distinguishes benign lack-of-specificity errors from more problematic misalignments (A), and analyzing abstractionmatch identifies which human concepts the model aligns with, highlighting potential failure cases (B).
  • Figure 5: Abstraction alignment helps ML researchers better understand the specificity of generative language models, revealing the model's tendency to generate specific outputs at the expense of correctness (A), confuse seemingly unrelated concepts (B), and overrely on one particular concept (C).
  • ...and 1 more figures