Crafting Large Language Models for Enhanced Interpretability
Chung-En Sun, Tuomas Oikarinen, Tsui-Wei Weng
TL;DR
This work introduces CB-LLM, an intrinsically interpretable large language model built with a concept bottleneck, enabling explanations grounded in human-understandable concepts. It automates concept generation (via prompting) and scoring (via contrastive sentence embeddings), followed by a two-stage training of a Concept Bottleneck Layer and a sparse predictor. A novel Automatic Concept Correction (ACC) further aligns concept scores with true classes, yielding accuracy on par with or better than finetuned black-box models across SST2, YelpP, AGNews, and DBpedia, while enhancing faithfulness through human evaluations. The approach supports practical interpretability via visualization, concept unlearning, and case studies, highlighting potential improvements in transparency and fairness for NLP systems.
Abstract
We introduce the Concept Bottleneck Large Language Model (CB-LLM), a pioneering approach to creating inherently interpretable Large Language Models (LLMs). Unlike traditional black-box LLMs that rely on post-hoc interpretation methods with limited neuron function insights, CB-LLM sets a new standard with its built-in interpretability, scalability, and ability to provide clear, accurate explanations. This innovation not only advances transparency in language models but also enhances their effectiveness. Our unique Automatic Concept Correction (ACC) strategy successfully narrows the performance gap with conventional black-box LLMs, positioning CB-LLM as a model that combines the high accuracy of traditional LLMs with the added benefit of clear interpretability -- a feature markedly absent in existing LLMs.
