Discovering Chunks in Neural Embeddings for Interpretability
Shuchen Wu, Stephan Alaniz, Eric Schulz, Zeynep Akata
TL;DR
Discovering Chunks in Neural Embeddings for Interpretability presents a cognitive-inspired framework (CNE) that interprets neural population activity as structured reflections of data through recurring chunks. The Reflection Hypothesis is tested across simple RNNs and large language models, using three extraction methods—Discrete Sequence Chunking, Neural Population Averaging, and Unsupervised Chunk Discovery—to identify interpretable chunks in embedding spaces of varying dimensionality. Empirical results show that chunks can causally influence predictions, enable compositional learning, and align with linguistic structure (e.g., POS tags) in LLMs, with unsupervised chunks capturing syntactic information. The work offers a scalable interpretability paradigm that reframes high-dimensional activations as assemblies of meaningful, recurring units, aiding transparency and debugging in neural systems.
Abstract
Understanding neural networks is challenging due to their high-dimensional, interacting components. Inspired by human cognition, which processes complex sensory data by chunking it into recurring entities, we propose leveraging this principle to interpret artificial neural population activities. Biological and artificial intelligence share the challenge of learning from structured, naturalistic data, and we hypothesize that the cognitive mechanism of chunking can provide insights into artificial systems. We first demonstrate this concept in recurrent neural networks (RNNs) trained on artificial sequences with imposed regularities, observing that their hidden states reflect these patterns, which can be extracted as a dictionary of chunks that influence network responses. Extending this to large language models (LLMs) like LLaMA, we identify similar recurring embedding states corresponding to concepts in the input, with perturbations to these states activating or inhibiting the associated concepts. By exploring methods to extract dictionaries of identifiable chunks across neural embeddings of varying complexity, our findings introduce a new framework for interpreting neural networks, framing their population activity as structured reflections of the data they process.
