Crafting Large Language Models for Enhanced Interpretability

Chung-En Sun; Tuomas Oikarinen; Tsui-Wei Weng

Crafting Large Language Models for Enhanced Interpretability

Chung-En Sun, Tuomas Oikarinen, Tsui-Wei Weng

TL;DR

This work introduces CB-LLM, an intrinsically interpretable large language model built with a concept bottleneck, enabling explanations grounded in human-understandable concepts. It automates concept generation (via prompting) and scoring (via contrastive sentence embeddings), followed by a two-stage training of a Concept Bottleneck Layer and a sparse predictor. A novel Automatic Concept Correction (ACC) further aligns concept scores with true classes, yielding accuracy on par with or better than finetuned black-box models across SST2, YelpP, AGNews, and DBpedia, while enhancing faithfulness through human evaluations. The approach supports practical interpretability via visualization, concept unlearning, and case studies, highlighting potential improvements in transparency and fairness for NLP systems.

Abstract

We introduce the Concept Bottleneck Large Language Model (CB-LLM), a pioneering approach to creating inherently interpretable Large Language Models (LLMs). Unlike traditional black-box LLMs that rely on post-hoc interpretation methods with limited neuron function insights, CB-LLM sets a new standard with its built-in interpretability, scalability, and ability to provide clear, accurate explanations. This innovation not only advances transparency in language models but also enhances their effectiveness. Our unique Automatic Concept Correction (ACC) strategy successfully narrows the performance gap with conventional black-box LLMs, positioning CB-LLM as a model that combines the high accuracy of traditional LLMs with the added benefit of clear interpretability -- a feature markedly absent in existing LLMs.

Crafting Large Language Models for Enhanced Interpretability

TL;DR

Abstract

Paper Structure (30 sections, 5 equations, 10 figures, 5 tables)

This paper contains 30 sections, 5 equations, 10 figures, 5 tables.

Introduction
Background and related works
Post-hoc neuron analysis for NLP.
CBM in image classification.
Sentence embedding models with contrastive learning.
CB-LLMs: Building Interpretable Large Language Models
Concept generation
Automatic Concept Scoring (ACS)
Learning CB-LLM
Training the concept bottleneck layer (CBL):
Learning the predictor:
Automatic Concept Correction
Experiment results
Setup.
Accuracy of CB-LLM
...and 15 more sections

Figures (10)

Figure 1: The overview of our CB-LLM.
Figure 2: The process of Automatic Concept Scoring (ACS) through sentence embedding models.
Figure 3: The human evaluation results for task 2 --- Contribution Faithfulness. Workers prefer the explanations generated by CB-LLM w/ ACC more than the random explanations.
Figure 4: Ablation study on Automatic Concept Correction (ACC). Workers favor the explanations provided by the CB-LLMs with ACC.
Figure 5: Ablation study on the sparsity. Workers demonstrate only a marginal preference for explanations provided by the CB-LLMs with a sparse final layer.
...and 5 more figures

Crafting Large Language Models for Enhanced Interpretability

TL;DR

Abstract

Crafting Large Language Models for Enhanced Interpretability

Authors

TL;DR

Abstract

Table of Contents

Figures (10)