Interpretability for Time Series Transformers using A Concept Bottleneck Framework

Angela van Sprang; Erman Acar; Willem Zuidema

Interpretability for Time Series Transformers using A Concept Bottleneck Framework

Angela van Sprang, Erman Acar, Willem Zuidema

TL;DR

The paper introduces a concept bottleneck framework for time-series transformers that uses Centered Kernel Alignment to align bottleneck representations with predefined interpretable concepts, aiming to enhance mechanistic interpretability without sacrificing forecasting accuracy. The approach is validated across Vanilla Transformer, Autoformer, and FEDformer on synthetic and seven real-world datasets, with extensive interpretability analyses (CKA scores and component visualizations) and an intervention experiment (activation patching) demonstrating faithfulness. Results show improved or maintained predictive performance alongside clearer, concept-aligned internal representations, and they confirm the causal relevance of bottleneck components through interventions. Limitations include the need to predefine interpretable concepts and added computational cost, with future work extending the concept set and exploring other modalities.

Abstract

Mechanistic interpretability focuses on reverse engineering the internal mechanisms learned by neural networks. We extend our focus and propose to mechanistically forward engineer using our framework based on Concept Bottleneck Models. In the context of long-term time series forecasting, we modify the training objective to encourage a model to develop representations which are similar to predefined, interpretable concepts using Centered Kernel Alignment. This steers the bottleneck components to learn the predefined concepts, while allowing other components to learn other, undefined concepts. We apply the framework to the Vanilla Transformer, Autoformer and FEDformer, and present an in-depth analysis on synthetic data and on a variety of benchmark datasets. We find that the model performance remains mostly unaffected, while the model shows much improved interpretability. Additionally, we verify the interpretation of the bottleneck components with an intervention experiment using activation patching.

Interpretability for Time Series Transformers using A Concept Bottleneck Framework

TL;DR

Abstract

Paper Structure (39 sections, 11 equations, 22 figures, 6 tables)

This paper contains 39 sections, 11 equations, 22 figures, 6 tables.

Introduction
Background and Related Work
Concept Bottleneck Models
Knowledge Transfer with Centered Kernel Alignment
Time Series Transformers
Method
Loss Function
Interpretable Concepts in the Bottleneck
Implementation details.
Experiments
Synthetic Data
Real-world data
Interpretability Analysis
CKA Analysis
Component Visualizations
...and 24 more sections

Figures (22)

Figure 1: Overview of the concept bottleneck framework. The bottleneck is one encoder layer which is trained to be similar to pre-defined, interpretable concepts. The residual stream around the bottleneck is removed, such that all information passes through the bottleneck.
Figure 2: Architecture of a transformer with a concept bottleneck in the attention mechanism (blue) or the FF network (red). Note that the residual connection is removed at the location of the bottleneck (and the residual stream thus interrupted). Visualisation inspired by rai_practical_2024.
Figure 3: Forecast and CKA scores of the attention bottleneck Autoformer on synthetic data, where the three heads of each layer (vertically) are compared with the three concept vectors (horizontally).
Figure 4: CKA scores on different concepts for the encoder of the Vanilla Transformer without bottleneck and with FF bottleneck. Both models contain three heads per layer. The first component of layer1 (lower row) of the attention bottleneck is trained to be similar to AR, and the second component (middle row) to the hour-of-day concept. The scores are calculated on three batches of size 32 from the electricity test data. Recall that CKA is defined on a scale from 0 to 1, where 1 denotes perfect similarity.
Figure 5: Forecasts from individual bottleneck components by masking the other components with zero in \ref{['fig: comp 0']}, \ref{['fig: comp 1']} and \ref{['fig: comp 2']} (FF bottleneck Autoformer on electricity data). The first half of the ground truth forms the input to the model. Note that the horizontal axes are the same across all figures, but Figure \ref{['fig: comp 1']} contains a grid of days instead of numbered hours. Figure \ref{['fig: ar']} shows the forecast made by the surrogate model AR; Figure \ref{['fig: full bottleneck']} shows the forecast of the entire layer (i.e., all components together), and \ref{['fig: extra comp3 final']} shows the forecast of the final layer when only the third component is used in the bottleneck layer. Note the difference between Figures \ref{['fig: comp 2']} and \ref{['fig: extra comp3 final']}, where we decode from the bottleneck and the final layer, respectively.
...and 17 more figures

Interpretability for Time Series Transformers using A Concept Bottleneck Framework

TL;DR

Abstract

Interpretability for Time Series Transformers using A Concept Bottleneck Framework

Authors

TL;DR

Abstract

Table of Contents

Figures (22)