Interpretable-by-Design Text Understanding with Iteratively Generated Concept Bottleneck

Josh Magnus Ludan; Qing Lyu; Yue Yang; Liam Dugan; Mark Yatskar; Chris Callison-Burch

Interpretable-by-Design Text Understanding with Iteratively Generated Concept Bottleneck

Josh Magnus Ludan, Qing Lyu, Yue Yang, Liam Dugan, Mark Yatskar, Chris Callison-Burch

TL;DR

TBM addresses interpretability in text classification with a three-module, end-to-end framework that automatically discovers, measures, and aggregates sparse, human-readable concepts via LLMs. The model provides global explanations through learned concept weights and local explanations via per-example concept scores and cited snippets. Across 12 diverse datasets, TBM achieves competitive end-to-end performance with strong baselines, particularly in sentiment tasks, while human studies validate concept quality and measurement reliability. The work highlights TBM as a promising direction for interpretable NLP with minimal performance tradeoffs, while noting limitations and opportunities for scalability and refinement.

Abstract

Black-box deep neural networks excel in text classification, yet their application in high-stakes domains is hindered by their lack of interpretability. To address this, we propose Text Bottleneck Models (TBM), an intrinsically interpretable text classification framework that offers both global and local explanations. Rather than directly predicting the output label, TBM predicts categorical values for a sparse set of salient concepts and uses a linear layer over those concept values to produce the final prediction. These concepts can be automatically discovered and measured by a Large Language Model (LLM) without the need for human curation. Experiments on 12 diverse text understanding datasets demonstrate that TBM can rival the performance of black-box baselines such as few-shot GPT-4 and finetuned DeBERTa while falling short against finetuned GPT-3.5. Comprehensive human evaluation validates that TBM can generate high-quality concepts relevant to the task, and the concept measurement aligns well with human judgments, suggesting that the predictions made by TBMs are interpretable. Overall, our findings suggest that TBM is a promising new framework that enhances interpretability with minimal performance tradeoffs.

Interpretable-by-Design Text Understanding with Iteratively Generated Concept Bottleneck

TL;DR

Abstract

Paper Structure (32 sections, 11 figures, 5 tables, 1 algorithm)

This paper contains 32 sections, 11 figures, 5 tables, 1 algorithm.

Introduction
Related Work
Method
Method Formulation
Concept Representations
Concept Generation
Concept Measurement
Prediction Layer
Experimental Setup
Results
End-to-End Performance
Human Evaluation on Concept Generation Module
Human Evaluation on Concept Measurement Module
Analysis of Learning Curves
Conclusion and Limitations
...and 17 more sections

Figures (11)

Figure 1: Unlike end-to-end black-box language models (left), Text Bottleneck Models (right) first discover and measure a set of human-readable concepts and then predict the final label with an interpretable linear layer.
Figure 2: Demonstration of the Textual Bottleneck Model (TBM) with an example from the CEBaB abraham2022cebab dataset. Given an input example (restaurant review), during (a) Concept Generation (Sec \ref{['sec: concept generation']}), it iteratively discovers new concepts (e.g., "Menu Variety"). (b) Concept Measurement (Sec \ref{['sec: concept measurement']}) measures the value of concepts by identifying relevant snippets (e.g., "food for everyone") and providing a numerical concept score (e.g., $+1$). Finally, the (c) Prediction Layer (\ref{['sec: linear head']}) aggregates all concept scores for the input and learns their relative weights to make the final prediction of the task label.
Figure 3: Expert concept annotations for concept generation quality on five aspects: Redundancy (Rdy) is concept duplication, "bad" indicates repetition; Relevance (Rlv) is pertinence to the task, "bad" identifies spurious concepts; Leakage (Lkg) checks if the concept directly performs the task, "bad" indicates leakage; Objectivity (Obj) is measurability clarity, with "bad" indicates subjectivity; and Difficulty (Dfc) checks the complexity of measuring the concept, "bad" means the concept measurement is harder than dataset task.
Figure 4: Human evaluation on concept measurement. Machine-human correlation measures the Pearson correlation between the concept scores measured by the LLM vs. human annotators. Exact Match refers to the performance of the LLM in predicting the exact string label for a concept when using human annotation as ground truth.
Figure 5: Concept learning curves of TBM on 3 datasets. The x-axis represents the TBM's performance (MSE for regression task and Accuracy for classification tasks) at each iteration, and the y-axis indicates the specific concept added to the bottleneck during that iteration. The size of each node is determined by the magnitude of the weight of the corresponding concept in the prediction layer.
...and 6 more figures

Interpretable-by-Design Text Understanding with Iteratively Generated Concept Bottleneck

TL;DR

Abstract

Interpretable-by-Design Text Understanding with Iteratively Generated Concept Bottleneck

Authors

TL;DR

Abstract

Table of Contents

Figures (11)