Interpreto: An Explainability Library for Transformers
Antonin Poché, Thomas Mullor, Gabriele Sarti, Frédéric Boisnard, Corentin Friedrich, Charlotte Claye, François Hoofd, Raphael Bernas, Céline Hudelot, Fanny Jourdan
TL;DR
Interpreto addresses the need for practical, unified explainability tools for HuggingFace NLP models by integrating both attribution and concept-based analyses in a single library. The framework provides an attribution module with eleven methods and four metrics, plus a concept-based module that enables unsupervised concept discovery via dictionary learning and multiple interpretation/importance-estimation tools. It emphasizes usability, reproducibility, and extensibility, including extensive documentation and tutorials, and plans to broaden capabilities through supervised concepts, multimodal support, and deeper integration between attribution and concept pipelines. The work aims to make mechanistic interpretability and post-hoc explanations more accessible to researchers and practitioners, accelerating robust evaluation and debugging of transformer-based models.
Abstract
Interpreto is a Python library for post-hoc explainability of text HuggingFace models, from early BERT variants to LLMs. It provides two complementary families of methods: attributions and concept-based explanations. The library connects recent research to practical tooling for data scientists, aiming to make explanations accessible to end users. It includes documentation, examples, and tutorials. Interpreto supports both classification and generation models through a unified API. A key differentiator is its concept-based functionality, which goes beyond feature-level attributions and is uncommon in existing libraries. The library is open source; install via pip install interpreto. Code and documentation are available at https://github.com/FOR-sight-ai/interpreto.
