Aligned at the Start: Conceptual Groupings in LLM Embeddings
Mehrdad Khatir, Sanchit Kabra, Chandan K. Reddy
TL;DR
The paper investigates how base input embeddings in transformer-based LLMs encode conceptual structure prior to contextual processing, using a pipeline that combines a fuzzy graph construction over $k$-NN embeddings with Louvain community detection to extract hierarchical concept clusters. It demonstrates significant human-aligned categorization in the embedding space, notable intra-cluster organization including a topological ordering of numbers, and moderate to high alignment of concepts across diverse models and architectures. A bias-mitigation case study shows that targeted cluster modification via embedding engineering can reduce ethnicity bias while preserving task performance, highlighting practical implications for fairness and robustness. These findings offer a path toward interpretable, manipulable embedding-level representations and point to embedding engineering as a viable tool for safety and reliability in LLM applications.
Abstract
This paper shifts focus to the often-overlooked input embeddings - the initial representations fed into transformer blocks. Using fuzzy graph, k-nearest neighbor (k-NN), and community detection, we analyze embeddings from diverse LLMs, finding significant categorical community structure aligned with predefined concepts and categories aligned with humans. We observe these groupings exhibit within-cluster organization (such as hierarchies, topological ordering, etc.), hypothesizing a fundamental structure that precedes contextual processing. To further investigate the conceptual nature of these groupings, we explore cross-model alignments across different LLM categories within their input embeddings, observing a medium to high degree of alignment. Furthermore, provide evidence that manipulating these groupings can play a functional role in mitigating ethnicity bias in LLM tasks.
