Table of Contents
Fetching ...

B-cos LM: Efficiently Transforming Pre-trained Language Models for Improved Explainability

Yifan Wang, Sukrut Rao, Ji-Ung Lee, Mayank Jobanputra, Vera Demberg

TL;DR

The paper introduces B-cos LMs, a bias-free, dynamically linear transformer variant that improves explainability by enforcing input-weight alignment through a tunable alignment pressure B. By combining B-cos conversion with task fine-tuning, the authors transform pretrained language models into explainable LMs that provide faithful, human-interpretable rationales while maintaining strong task performance and fast convergence. Extensive automatic and human evaluations demonstrate that B-cos explanations surpass post-hoc methods in faithfulness and interpretability across text classification tasks, with additional insights into how alignment pressure shapes learning and biases. The work also extends B-cosification to decoder-only models for generation tasks and offers practical guidelines for applying B-cos LMs in NLP pipelines, highlighting potential for broader transparency in language technologies.

Abstract

Post-hoc explanation methods for black-box models often struggle with faithfulness and human interpretability due to the lack of explainability in current neural architectures. Meanwhile, B-cos networks have been introduced to improve model explainability by proposing an architecture that removes bias terms and promotes input-weight alignment. Although B-cos networks have shown success in building explainable systems, their application has so far been limited to computer vision models and their associated training pipelines. In this work, we introduce B-cos LMs, i.e., B-cos Language Models (LMs) empowered for natural language processing (NLP) tasks. Our approach directly transforms pre-trained language models into B-cos LMs by combining B-cos conversion and task fine-tuning, improving efficiency compared to previous methods. Automatic and human evaluation results demonstrate that B-cos LMs produce more faithful and human interpretable explanations than post-hoc methods, while maintaining task performance comparable to conventional fine-tuning. Our in-depth analysis explores how B-cos LMs differ from conventionally fine-tuned models in their learning processes and explanation patterns. Finally, we present a first exploration of transforming decoder-only models to B-cos LMs for generation tasks. Our code is available at https://github.com/Ewanwong/bcos_lm.

B-cos LM: Efficiently Transforming Pre-trained Language Models for Improved Explainability

TL;DR

The paper introduces B-cos LMs, a bias-free, dynamically linear transformer variant that improves explainability by enforcing input-weight alignment through a tunable alignment pressure B. By combining B-cos conversion with task fine-tuning, the authors transform pretrained language models into explainable LMs that provide faithful, human-interpretable rationales while maintaining strong task performance and fast convergence. Extensive automatic and human evaluations demonstrate that B-cos explanations surpass post-hoc methods in faithfulness and interpretability across text classification tasks, with additional insights into how alignment pressure shapes learning and biases. The work also extends B-cosification to decoder-only models for generation tasks and offers practical guidelines for applying B-cos LMs in NLP pipelines, highlighting potential for broader transparency in language technologies.

Abstract

Post-hoc explanation methods for black-box models often struggle with faithfulness and human interpretability due to the lack of explainability in current neural architectures. Meanwhile, B-cos networks have been introduced to improve model explainability by proposing an architecture that removes bias terms and promotes input-weight alignment. Although B-cos networks have shown success in building explainable systems, their application has so far been limited to computer vision models and their associated training pipelines. In this work, we introduce B-cos LMs, i.e., B-cos Language Models (LMs) empowered for natural language processing (NLP) tasks. Our approach directly transforms pre-trained language models into B-cos LMs by combining B-cos conversion and task fine-tuning, improving efficiency compared to previous methods. Automatic and human evaluation results demonstrate that B-cos LMs produce more faithful and human interpretable explanations than post-hoc methods, while maintaining task performance comparable to conventional fine-tuning. Our in-depth analysis explores how B-cos LMs differ from conventionally fine-tuned models in their learning processes and explanation patterns. Finally, we present a first exploration of transforming decoder-only models to B-cos LMs for generation tasks. Our code is available at https://github.com/Ewanwong/bcos_lm.

Paper Structure

This paper contains 68 sections, 6 equations, 16 figures, 15 tables.

Figures (16)

  • Figure 1: Visualization of $\mathbf{W(x)x}$ in a conventionally fine-tuned model (Conventional LM) and a B-cos LM.Green (red) indicates the positive (negative) impact of tokens on the prediction. In both examples, both models correctly predict not toxic. In the Conventional LM, "funny" is incorrectly assigned a negative attribution in example (a), and in example (b), irrelevant words like "why" and "smell" are highlighted, making the explanations unfaithful and less interpretable. Examples and explanations are drawn from HateXplain. See §\ref{['sec:methodology']} for details on how $\mathbf{W(x)x}$ is computed.
  • Figure 2: Mean accuracy of conventionally fine-tuned, Saloss and B-cos BERT averaged over three runs. B-cos models perform comparably to conventional models on most tasks.
  • Figure 3: Human evaluation reveals that B-cos explanations have better human interpretability and human agreement than baseline methods.
  • Figure 4: Examples of B-cos explanations (B-cos BERT) as well as ShapSampl and DecompX explanations (conv. BERT) from AG News. Green (red) indicates the positive (negative) impact of tokens on the prediction. The B-cos explanation highlights only relevant tokens and is more interpretable to humans.
  • Figure 5: Varying B for B-cos BERT (HateXplain). Accuracy and Comp both peak around B=1.5, while explanation entropy negatively correlates with B.
  • ...and 11 more figures