Table of Contents
Fetching ...

InversionView: A General-Purpose Method for Reading Information from Neural Activations

Xinting Huang, Madhur Panwar, Navin Goyal, Michael Hahn

TL;DR

InversionView is proposed, which allows us to practically inspect this subset of inputs that give rise to similar activations by sampling from a trained decoder model conditioned on activations, and helps uncover the information content of activation vectors, and facilitates understanding of the algorithms implemented by transformer models.

Abstract

The inner workings of neural networks can be better understood if we can fully decipher the information encoded in neural activations. In this paper, we argue that this information is embodied by the subset of inputs that give rise to similar activations. We propose InversionView, which allows us to practically inspect this subset by sampling from a trained decoder model conditioned on activations. This helps uncover the information content of activation vectors, and facilitates understanding of the algorithms implemented by transformer models. We present four case studies where we investigate models ranging from small transformers to GPT-2. In these studies, we show that InversionView can reveal clear information contained in activations, including basic information about tokens appearing in the context, as well as more complex information, such as the count of certain tokens, their relative positions, and abstract knowledge about the subject. We also provide causally verified circuits to confirm the decoded information.

InversionView: A General-Purpose Method for Reading Information from Neural Activations

TL;DR

InversionView is proposed, which allows us to practically inspect this subset of inputs that give rise to similar activations by sampling from a trained decoder model conditioned on activations, and helps uncover the information content of activation vectors, and facilitates understanding of the algorithms implemented by transformer models.

Abstract

The inner workings of neural networks can be better understood if we can fully decipher the information encoded in neural activations. In this paper, we argue that this information is embodied by the subset of inputs that give rise to similar activations. We propose InversionView, which allows us to practically inspect this subset by sampling from a trained decoder model conditioned on activations. This helps uncover the information content of activation vectors, and facilitates understanding of the algorithms implemented by transformer models. We present four case studies where we investigate models ranging from small transformers to GPT-2. In these studies, we show that InversionView can reveal clear information contained in activations, including basic information about tokens appearing in the context, as well as more complex information, such as the count of certain tokens, their relative positions, and abstract knowledge about the subject. We also provide causally verified circuits to confirm the decoded information.
Paper Structure (70 sections, 5 equations, 45 figures, 4 tables)

This paper contains 70 sections, 5 equations, 45 figures, 4 tables.

Figures (45)

  • Figure 1: Illustration of the geometry at two different activation sites, encoding different information about the input. Top: the semantics of being on leave are encoded. Bottom: the information that the subject of the input sentence is John is encoded.
  • Figure 2: (a) The probed model is trained on language modeling objective. (b) Given a trained probed model, we first cache the internal activations $\mathbf{z}$ together with their corresponding inputs and activation site indices (omitted in the figure for brevity), then use them to train the decoder. The decoder is trained with language modeling objective, while being able to attend to $\mathbf{z}$. (c) When interpreting a specific query activation $\mathbf{z^q}$, we give it to the decoder, which generates possible inputs auto-regressively. We then evaluate the distances on the original probed model.
  • Figure 3: InversionView on Character Counting Task. The model counts how often the target character (after '|') occurs in the prefix (before '|'). B and E denote beginning and end of sequence tokens. The query activation conditions the decoder to generate samples capturing its information content. We show non-cherrypicked samples inside and outside the $\epsilon$-preimage ($\epsilon = 0.1$) at three activation sites on the same query input. Distance for each sample is calculated between activations corresponding to the parenthesized characters in the query input and the sample. "True count" indicates the correct count of the target character in the samples (decoder may generate incorrect counts). (a)MLP layer amplifies count information. Comparing the distances before (left) and after (right) the MLP, we see that samples with diverging counts become much more distant from the query activation. (b) In the next layer (":" exclusively attends to target character -- copying information from residual stream of target character to the residual stream of ":"), the count is retained but the identity of the target character is no longer encoded ("c", "m", etc. instead of "g"), as it is no longer relevant for the predicting the count. Therefore, observing the generations informs us of the activations' content and how it changes across activation sites.
  • Figure 4: (a) Character Counting. Activation patching results show that $a^{0,0}_{tc}$ and $a^{1,0}_:$ play crucial roles in prediction, as hypothesized based on Figure \ref{['fig:char-count']} and Sec. \ref{['sec:3-digit-addition']}. In contrast examples, only one character differs. Top: We patch activations cumulatively from left to right. We can see patching $a^{0,0}_{tc}$ accounts for the whole effect, and when $a^{0,0}_{tc}$ is already patched, patching $a^{1,0}_:$ has almost no effect. Bottom: On the other hand, if we patch cumulatively from right to left, $a^{1,0}_:$ accounts for the whole effect while patching $a^{0,0}_{tc}$ has no effect if $a^{1,0}_:$ has been patched. So we verified that $a^{1,0}_:$ solely relies on $a^{0,0}_{tc}$ and this path is the one by which the model performs precise counting. The patching effect is averaged across the whole test set. (b) IOI. InversionView applied to Name Mover Head 9.9 at "to"; we fix the compared position to "to". Throughout the $\epsilon$-preimage, "Justin" appears as the IO, revealing that the head encodes this name. This interpretation is confirmed across query inputs.
  • Figure 5: InversionView applied to 3-digit addition: Visually inspecting sample inputs inside and outside the $\epsilon$-preimage of the query allows us to understand what information is contained in an activation. The color on each token in generated samples denotes the difference in the token's likelihood between a conditional or unconditional decoder (Appendix \ref{['ap:decoder-likelihood-diff']}). The shade thus denotes how much the generation of the token is caused by the query activation (darker shade means a stronger dependence). In (a--c), the colored tokens are most relevant to the interpretation. We interpret two attention heads (a,b) and the output of the corresponding residual stream after attention (c). In (a), what's common throughout the $\epsilon$-preimage is that the digits in the hundreds places are 6 and 8. Inputs outside the $\epsilon$-preimage don't have this property. In (b), what's common is that the digits in tens places are 1, 6, or numerically close. Hence, we can infer that the activation sites $a^{0,0}$ and $a^{0,3}$ encode hundreds and tens place in the input operands respectively; the latter is needed to provide carry to A1. Also, the samples show that the activations encode commutativity since the digits at hundreds and tens place are swapped between the two operands. In (c), the output of the attention layer after residual connection combining information from the sites in (a) and (b) encodes "6" and "8" in hundreds place, and the carry from tens place. Note that $a^{0,1}$ and $a^{0,2}$ contains similar information as $a^{0,0}$. These observations are confirmed across inputs. Taken together, InversionView reveals how information is aggregated and passed on by different model components.
  • ...and 40 more figures