Table of Contents
Fetching ...

Interpreto: An Explainability Library for Transformers

Antonin Poché, Thomas Mullor, Gabriele Sarti, Frédéric Boisnard, Corentin Friedrich, Charlotte Claye, François Hoofd, Raphael Bernas, Céline Hudelot, Fanny Jourdan

TL;DR

Interpreto addresses the need for practical, unified explainability tools for HuggingFace NLP models by integrating both attribution and concept-based analyses in a single library. The framework provides an attribution module with eleven methods and four metrics, plus a concept-based module that enables unsupervised concept discovery via dictionary learning and multiple interpretation/importance-estimation tools. It emphasizes usability, reproducibility, and extensibility, including extensive documentation and tutorials, and plans to broaden capabilities through supervised concepts, multimodal support, and deeper integration between attribution and concept pipelines. The work aims to make mechanistic interpretability and post-hoc explanations more accessible to researchers and practitioners, accelerating robust evaluation and debugging of transformer-based models.

Abstract

Interpreto is a Python library for post-hoc explainability of text HuggingFace models, from early BERT variants to LLMs. It provides two complementary families of methods: attributions and concept-based explanations. The library connects recent research to practical tooling for data scientists, aiming to make explanations accessible to end users. It includes documentation, examples, and tutorials. Interpreto supports both classification and generation models through a unified API. A key differentiator is its concept-based functionality, which goes beyond feature-level attributions and is uncommon in existing libraries. The library is open source; install via pip install interpreto. Code and documentation are available at https://github.com/FOR-sight-ai/interpreto.

Interpreto: An Explainability Library for Transformers

TL;DR

Interpreto addresses the need for practical, unified explainability tools for HuggingFace NLP models by integrating both attribution and concept-based analyses in a single library. The framework provides an attribution module with eleven methods and four metrics, plus a concept-based module that enables unsupervised concept discovery via dictionary learning and multiple interpretation/importance-estimation tools. It emphasizes usability, reproducibility, and extensibility, including extensive documentation and tutorials, and plans to broaden capabilities through supervised concepts, multimodal support, and deeper integration between attribution and concept pipelines. The work aims to make mechanistic interpretability and post-hoc explanations more accessible to researchers and practitioners, accelerating robust evaluation and debugging of transformer-based models.

Abstract

Interpreto is a Python library for post-hoc explainability of text HuggingFace models, from early BERT variants to LLMs. It provides two complementary families of methods: attributions and concept-based explanations. The library connects recent research to practical tooling for data scientists, aiming to make explanations accessible to end users. It includes documentation, examples, and tutorials. Interpreto supports both classification and generation models through a unified API. A key differentiator is its concept-based functionality, which goes beyond feature-level attributions and is uncommon in existing libraries. The library is open source; install via pip install interpreto. Code and documentation are available at https://github.com/FOR-sight-ai/interpreto.

Paper Structure

This paper contains 31 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Examples of code and output for attribution explanations. a) The model classifies the text as a positive review, the Lime method tells that it is mainly due to the presence of the word "great". b) In generation, each output token is a prediction, hence there is an attribution for each one. Here, we select the "conference" output word (by clicking on the interactive explanation). Then the attribution for the prediction of the word "conference" shows that the words "workshop" and "an" were the most important for the prediction of "conference".
  • Figure 2: Post-hoc, unsupervised concept-based pipeline. See Section \ref{['cpt_pipeline']}.
  • Figure 3: Code and output for concept-based explanations on a Qwen3-0.6Byang2025qwen3 generation model. Concepts are learned from activations computed on 100 AG Newszhang2015character samples. The full pipeline runs in under 3 minutes on an RTX 3080 (10 GB).
  • Figure 4: Global concept-based explanations for a DistilBERTsanh2019distilbert classifier fine-tuned on AG Newszhang2015character. Extracted from the https://for-sight-ai.github.io/interpreto/notebooks/classification_concept_tutorial/.