Table of Contents
Fetching ...

Discovering Chunks in Neural Embeddings for Interpretability

Shuchen Wu, Stephan Alaniz, Eric Schulz, Zeynep Akata

TL;DR

Discovering Chunks in Neural Embeddings for Interpretability presents a cognitive-inspired framework (CNE) that interprets neural population activity as structured reflections of data through recurring chunks. The Reflection Hypothesis is tested across simple RNNs and large language models, using three extraction methods—Discrete Sequence Chunking, Neural Population Averaging, and Unsupervised Chunk Discovery—to identify interpretable chunks in embedding spaces of varying dimensionality. Empirical results show that chunks can causally influence predictions, enable compositional learning, and align with linguistic structure (e.g., POS tags) in LLMs, with unsupervised chunks capturing syntactic information. The work offers a scalable interpretability paradigm that reframes high-dimensional activations as assemblies of meaningful, recurring units, aiding transparency and debugging in neural systems.

Abstract

Understanding neural networks is challenging due to their high-dimensional, interacting components. Inspired by human cognition, which processes complex sensory data by chunking it into recurring entities, we propose leveraging this principle to interpret artificial neural population activities. Biological and artificial intelligence share the challenge of learning from structured, naturalistic data, and we hypothesize that the cognitive mechanism of chunking can provide insights into artificial systems. We first demonstrate this concept in recurrent neural networks (RNNs) trained on artificial sequences with imposed regularities, observing that their hidden states reflect these patterns, which can be extracted as a dictionary of chunks that influence network responses. Extending this to large language models (LLMs) like LLaMA, we identify similar recurring embedding states corresponding to concepts in the input, with perturbations to these states activating or inhibiting the associated concepts. By exploring methods to extract dictionaries of identifiable chunks across neural embeddings of varying complexity, our findings introduce a new framework for interpreting neural networks, framing their population activity as structured reflections of the data they process.

Discovering Chunks in Neural Embeddings for Interpretability

TL;DR

Discovering Chunks in Neural Embeddings for Interpretability presents a cognitive-inspired framework (CNE) that interprets neural population activity as structured reflections of data through recurring chunks. The Reflection Hypothesis is tested across simple RNNs and large language models, using three extraction methods—Discrete Sequence Chunking, Neural Population Averaging, and Unsupervised Chunk Discovery—to identify interpretable chunks in embedding spaces of varying dimensionality. Empirical results show that chunks can causally influence predictions, enable compositional learning, and align with linguistic structure (e.g., POS tags) in LLMs, with unsupervised chunks capturing syntactic information. The work offers a scalable interpretability paradigm that reframes high-dimensional activations as assemblies of meaningful, recurring units, aiding transparency and debugging in neural systems.

Abstract

Understanding neural networks is challenging due to their high-dimensional, interacting components. Inspired by human cognition, which processes complex sensory data by chunking it into recurring entities, we propose leveraging this principle to interpret artificial neural population activities. Biological and artificial intelligence share the challenge of learning from structured, naturalistic data, and we hypothesize that the cognitive mechanism of chunking can provide insights into artificial systems. We first demonstrate this concept in recurrent neural networks (RNNs) trained on artificial sequences with imposed regularities, observing that their hidden states reflect these patterns, which can be extracted as a dictionary of chunks that influence network responses. Extending this to large language models (LLMs) like LLaMA, we identify similar recurring embedding states corresponding to concepts in the input, with perturbations to these states activating or inhibiting the associated concepts. By exploring methods to extract dictionaries of identifiable chunks across neural embeddings of varying complexity, our findings introduce a new framework for interpreting neural networks, framing their population activity as structured reflections of the data they process.

Paper Structure

This paper contains 31 sections, 11 equations, 21 figures, 1 table, 2 algorithms.

Figures (21)

  • Figure 1: (Top) Naturalistic data is highly redundant and compositional, e.g. in language sequences. Cognitive systems segment redundancies by chunking recurring patterns. The reflection hypothesis posits that ANNs neural activities can be interpreted as chunks that reflect the structured regularities in reality. (Bottom) In simple networks that contain a small number of neurons, chunking methods can be used to learn a dictionary of frequently recurring population trajectories. The discrete representations of the embedding state can reliably predict the input in the sequence and network's predictions.
  • Figure 2: Testing the reflection hypothesis with simple RNNs and artificial sequences. a. RNN updates predictions and memory states based on inputs and previous hidden state. b. Neural population activity (of the first 5 neurons) in response to repeating chunk (ABCD); c. Sparse occurrence of ABCD within a default sequence (E); d. ABCD persists as a cohesive chunk amid background noise (random E, F, G).
  • Figure 3: Left: Hidden states can be grafted to causally change network memory and prediction. Right: Embedding grafting enables faster transfer learning of a compositional vocabulary.
  • Figure 4: Left: Training creates extra chunks inside the embedding space. Right: The number of embedding states increases with the complexity of the input sequence.
  • Figure 5: a. Neural embedding activity of the first 50 neurons (unselected) across all layers (33) processing prompt up until the end of each highlighted word. b. Extracted neural activity chunks in response to word at different sequence positions.
  • ...and 16 more figures