Hyperdimensional Probe: Decoding LLM Representations via Vector Symbolic Architectures
Marco Bronzini, Carlo Nicolini, Bruno Lepri, Jacopo Staiano, Andrea Passerini
TL;DR
The paper tackles the opacity of LLM internal representations by proposing the Hyperdimensional Probe, a hybrid approach that blends Vector Symbolic Architectures with neural probing to jointly analyze input-driven features and output behavior. It introduces a three-stage pipeline—ingesting neural embeddings, mapping to a bounded VSA proxy space with a neural VSA encoder, and extracting concepts via hypervector unbinding—validated on analogy and QA tasks across multiple LLMs. The work shows that VSA encodings faithfully capture latent features, enabling robust concept extraction and revealing rich, model-dependent conceptual structures that traditional logit-based methods often miss. By enabling joint input-output analysis and avoiding layer-specific dependencies, the approach offers a scalable, interpretable framework for probing neural representations with practical implications for understanding reasoning and generation in LLMs. The results highlight both potential and limitations, pointing to broader applicability across modalities and tasks while acknowledging the challenge of predefining a comprehensive concept alphabet and validating causal links.
Abstract
Despite their capabilities, Large Language Models (LLMs) remain opaque with limited understanding of their internal representations. Current interpretability methods either focus on input-oriented feature extraction, such as supervised probes and Sparse Autoencoders (SAEs), or on output distribution inspection, such as logit-oriented approaches. A full understanding of LLM vector spaces, however, requires integrating both perspectives, something existing approaches struggle with due to constraints on latent feature definitions. We introduce the Hyperdimensional Probe, a hybrid supervised probe that combines symbolic representations with neural probing. Leveraging Vector Symbolic Architectures (VSAs) and hypervector algebra, it unifies prior methods: the top-down interpretability of supervised probes, SAE's sparsity-driven proxy space, and output-oriented logit investigation. This allows deeper input-focused feature extraction while supporting output-oriented investigation. Our experiments show that our method consistently extracts meaningful concepts across LLMs, embedding sizes, and setups, uncovering concept-driven patterns in analogy-oriented inference and QA-focused text generation. By supporting joint input-output analysis, this work advances semantic understanding of neural representations while unifying the complementary perspectives of prior methods.
