Does Transformer Interpretability Transfer to RNNs?

Gonçalo Paulo; Thomas Marshall; Nora Belrose

Does Transformer Interpretability Transfer to RNNs?

Gonçalo Paulo, Thomas Marshall, Nora Belrose

TL;DR

This work evaluates whether transformer interpretability techniques transfer to modern RNNs (Mamba and RWKV). It tests Contrastive Activation Addition, Tuned Lens, and Eliciting Latent Knowledge (ELK) probing, including Quirky-model scenarios, to assess steering, latent trajectory extraction, and latent knowledge retrieval in RNNs. The findings show these methods largely transfer to RNNs, with the compressed RNN state offering potential advantages for steering, and with probes revealing latent knowledge even when outputs are misleading. The results support applying interpretability tools to RNNs and point to future opportunities in leveraging internal states and mechanistic approaches for deeper understanding.

Abstract

Recent advances in recurrent neural network architectures, such as Mamba and RWKV, have enabled RNNs to match or exceed the performance of equal-size transformers in terms of language modeling perplexity and downstream evaluations, suggesting that future systems may be built on completely new architectures. In this paper, we examine if selected interpretability methods originally designed for transformer language models will transfer to these up-and-coming recurrent architectures. Specifically, we focus on steering model outputs via contrastive activation addition, on eliciting latent predictions via the tuned lens, and eliciting latent knowledge from models fine-tuned to produce false outputs under certain conditions. Our results show that most of these techniques are effective when applied to RNNs, and we show that it is possible to improve some of them by taking advantage of RNNs' compressed state.

Does Transformer Interpretability Transfer to RNNs?

TL;DR

Abstract

Paper Structure (20 sections, 4 equations, 12 figures, 6 tables)

This paper contains 20 sections, 4 equations, 12 figures, 6 tables.

Introduction
Architectures
Mamba
RWKV
Contrastive activation addition
Methodology
Steering with the activation vector
Steering with the state
Tuned lens
Logit lens
Tuned lens
Methodology and results
"Quirky" models
Methodology
Results
...and 5 more sections

Figures (12)

Figure 1: A single Mamba block, depicted by gu2023mamba. Green trapezoids are linear projections, while $\sigma$ denotes the Swish activation, and $\bigotimes$ denotes multiplication.
Figure 3: Steering in Mamba 2.8b and BTLM 3b. We observe a somewhat smaller steering response on Mamba (panel a) than on BTLM (panel b) for a significant fraction of behaviors. The response for Sycophancy is very weak for both models. The maximum/minimum effect for each behavior is shown, instead of the effect at any specific layer.
Figure 4: Steering in RWKV-v5 7b and Llama 2 3b. The responses of RWKV-v5 (panel a) are lower but less erratic compared to that of Llama 2 (panel b) which seems to have larger effects but a non-monotonic response to steering. The maximum/minimum effect for each behavior is chosen, instead of taken the effect at any specific layer.
Figure 5: Using the residual stream and the internal state for steering in Mamba and RWKV-v5 is not additive. For all behaviors, the sum of the effect of the individual steering is higher than when both steering effects are done at the same time. In the case of Mamba, the Survival Instinct behavior is very irregular, and we do see that steering with both the state and the residual stream slightly decreases the response.
Figure 6: Comparison between logit lens and tuned lens for 3 different architectures. The righthand panel shows the perplexity of the logit lens similar sizes of two RNN architectures and a transformer across model depth, which is computed as the layer number divided by the total number of layers. The lefthand shows the perplexity of the tuned lens for the same model sizes and architectures.
...and 7 more figures

Does Transformer Interpretability Transfer to RNNs?

TL;DR

Abstract

Does Transformer Interpretability Transfer to RNNs?

Authors

TL;DR

Abstract

Table of Contents

Figures (12)