B-cos LM: Efficiently Transforming Pre-trained Language Models for Improved Explainability

Yifan Wang; Sukrut Rao; Ji-Ung Lee; Mayank Jobanputra; Vera Demberg

B-cos LM: Efficiently Transforming Pre-trained Language Models for Improved Explainability

Yifan Wang, Sukrut Rao, Ji-Ung Lee, Mayank Jobanputra, Vera Demberg

TL;DR

The paper introduces B-cos LMs, a bias-free, dynamically linear transformer variant that improves explainability by enforcing input-weight alignment through a tunable alignment pressure B. By combining B-cos conversion with task fine-tuning, the authors transform pretrained language models into explainable LMs that provide faithful, human-interpretable rationales while maintaining strong task performance and fast convergence. Extensive automatic and human evaluations demonstrate that B-cos explanations surpass post-hoc methods in faithfulness and interpretability across text classification tasks, with additional insights into how alignment pressure shapes learning and biases. The work also extends B-cosification to decoder-only models for generation tasks and offers practical guidelines for applying B-cos LMs in NLP pipelines, highlighting potential for broader transparency in language technologies.

Abstract

Post-hoc explanation methods for black-box models often struggle with faithfulness and human interpretability due to the lack of explainability in current neural architectures. Meanwhile, B-cos networks have been introduced to improve model explainability by proposing an architecture that removes bias terms and promotes input-weight alignment. Although B-cos networks have shown success in building explainable systems, their application has so far been limited to computer vision models and their associated training pipelines. In this work, we introduce B-cos LMs, i.e., B-cos Language Models (LMs) empowered for natural language processing (NLP) tasks. Our approach directly transforms pre-trained language models into B-cos LMs by combining B-cos conversion and task fine-tuning, improving efficiency compared to previous methods. Automatic and human evaluation results demonstrate that B-cos LMs produce more faithful and human interpretable explanations than post-hoc methods, while maintaining task performance comparable to conventional fine-tuning. Our in-depth analysis explores how B-cos LMs differ from conventionally fine-tuned models in their learning processes and explanation patterns. Finally, we present a first exploration of transforming decoder-only models to B-cos LMs for generation tasks. Our code is available at https://github.com/Ewanwong/bcos_lm.

B-cos LM: Efficiently Transforming Pre-trained Language Models for Improved Explainability

TL;DR

Abstract

B-cos LM: Efficiently Transforming Pre-trained Language Models for Improved Explainability

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (16)