Interpretability for Time Series Transformers using A Concept Bottleneck Framework
Angela van Sprang, Erman Acar, Willem Zuidema
TL;DR
The paper introduces a concept bottleneck framework for time-series transformers that uses Centered Kernel Alignment to align bottleneck representations with predefined interpretable concepts, aiming to enhance mechanistic interpretability without sacrificing forecasting accuracy. The approach is validated across Vanilla Transformer, Autoformer, and FEDformer on synthetic and seven real-world datasets, with extensive interpretability analyses (CKA scores and component visualizations) and an intervention experiment (activation patching) demonstrating faithfulness. Results show improved or maintained predictive performance alongside clearer, concept-aligned internal representations, and they confirm the causal relevance of bottleneck components through interventions. Limitations include the need to predefine interpretable concepts and added computational cost, with future work extending the concept set and exploring other modalities.
Abstract
Mechanistic interpretability focuses on reverse engineering the internal mechanisms learned by neural networks. We extend our focus and propose to mechanistically forward engineer using our framework based on Concept Bottleneck Models. In the context of long-term time series forecasting, we modify the training objective to encourage a model to develop representations which are similar to predefined, interpretable concepts using Centered Kernel Alignment. This steers the bottleneck components to learn the predefined concepts, while allowing other components to learn other, undefined concepts. We apply the framework to the Vanilla Transformer, Autoformer and FEDformer, and present an in-depth analysis on synthetic data and on a variety of benchmark datasets. We find that the model performance remains mostly unaffected, while the model shows much improved interpretability. Additionally, we verify the interpretation of the bottleneck components with an intervention experiment using activation patching.
