Does Transformer Interpretability Transfer to RNNs?
Gonçalo Paulo, Thomas Marshall, Nora Belrose
TL;DR
This work evaluates whether transformer interpretability techniques transfer to modern RNNs (Mamba and RWKV). It tests Contrastive Activation Addition, Tuned Lens, and Eliciting Latent Knowledge (ELK) probing, including Quirky-model scenarios, to assess steering, latent trajectory extraction, and latent knowledge retrieval in RNNs. The findings show these methods largely transfer to RNNs, with the compressed RNN state offering potential advantages for steering, and with probes revealing latent knowledge even when outputs are misleading. The results support applying interpretability tools to RNNs and point to future opportunities in leveraging internal states and mechanistic approaches for deeper understanding.
Abstract
Recent advances in recurrent neural network architectures, such as Mamba and RWKV, have enabled RNNs to match or exceed the performance of equal-size transformers in terms of language modeling perplexity and downstream evaluations, suggesting that future systems may be built on completely new architectures. In this paper, we examine if selected interpretability methods originally designed for transformer language models will transfer to these up-and-coming recurrent architectures. Specifically, we focus on steering model outputs via contrastive activation addition, on eliciting latent predictions via the tuned lens, and eliciting latent knowledge from models fine-tuned to produce false outputs under certain conditions. Our results show that most of these techniques are effective when applied to RNNs, and we show that it is possible to improve some of them by taking advantage of RNNs' compressed state.
