Combining Transformers with Natural Language Explanations
Federico Ruggeri, Marco Lippi, Paolo Torroni
TL;DR
The paper tackles transformer interpretability by augmenting models with an external memory of natural language explanations that ground predictions. It introduces memBERT and memDistilBERT, employing sampling strategies and optional strong supervision to scale memory usage while preserving performance. Through experiments on unfairness detection in ToS clauses and claim detection in IBM2015, the approach yields meaningful explanations and often improved metrics, demonstrating that interpretability can come with tangible predictive gains. The work highlights practical pathways for scalable, NL-grounded explanations in NLP and outlines future directions for input-aware memory retrieval and generation tasks.
Abstract
Many NLP applications require models to be interpretable. However, many successful neural architectures, including transformers, still lack effective interpretation methods. A possible solution could rely on building explanations from domain knowledge, which is often available as plain, natural language text. We thus propose an extension to transformer models that makes use of external memories to store natural language explanations and use them to explain classification outputs. We conduct an experimental evaluation on two domains, legal text analysis and argument mining, to show that our approach can produce relevant explanations while retaining or even improving classification performance.
