Towards Achieving Concept Completeness for Textual Concept Bottleneck Models
Milan Bhan, Yann Choho, Pierre Moreau, Jean-Noel Vittaut, Nicolas Chesneau, Marie-Jeanne Lesot
TL;DR
The paper tackles interpretability in NLP by addressing the need for complete, reliable concept bases in textual concept bottleneck models. It introduces CT-CBM, a four-step framework that unsupervisedly constructs micro and macro concept banks using a small language model, scores concepts through concept activation vectors and identifiability measures, initializes a diverse and coverage-rich CBL, and trains simple and residual TCBMs with a stopping criterion that ensures concept completeness. CT-CBM achieves downstream performance on par with strong baselines while significantly reducing the number of concepts and greatly improving concept-detection accuracy, across both general and technical domains. The work demonstrates practical benefits including concept-level intervention, analysis of adversarial and counterfactual explanations, and global interpretability, showcasing a scalable, reproducible path to faithful NLP explanations. Overall, CT-CBM offers a principled, unsupervised route to complete, interpretable NLP classifiers with tangible impact on reliability and transparency.
Abstract
Textual Concept Bottleneck Models (TCBMs) are interpretable-by-design models for text classification that predict a set of salient concepts before making the final prediction. This paper proposes Complete Textual Concept Bottleneck Model (CT-CBM), a novel TCBM generator building concept labels in a fully unsupervised manner using a small language model, eliminating both the need for predefined human labeled concepts and LLM annotations. CT-CBM iteratively targets and adds important and identifiable concepts in the bottleneck layer to create a complete concept basis. CT-CBM achieves striking results against competitors in terms of concept basis completeness and concept detection accuracy, offering a promising solution to reliably enhance interpretability of NLP classifiers.
