Multi-Scale Manifold Alignment for Interpreting Large Language Models: A Unified Information-Geometric Framework
Yukun Zhang, Qi Dong
TL;DR
MSMA presents an information-geometric framework that decomposes LLM representations into three semantic manifolds—local $\mathcal{M}_L$, intermediate $\mathcal{M}_I$, and global $\mathcal{M}_G$—and learns cross-scale mappings that preserve geometry and information. By formalizing mappings $f_{GI}$ and $f_{IL}$ under principles of geometric preservation, information fidelity, and curvature regularity, and optimizing $\mathcal{L}_{\text{total}}=\lambda_{geo}\mathcal{L}_{geo}+\lambda_{info}\mathcal{L}_{info}+\lambda_{curv}\mathcal{L}_{curv}$ with MINE-based mutual information estimates, MSMA achieves near-perfect alignment across GPT-2, BERT, RoBERTa, and T5 (e.g., $99\%$ KL reduction and $5$–$7\times$ MI gains). Empirically, MSMA reveals a robust three-scale hierarchy, shows architecture-dependent cross-scale effects when intervening at specific scales (altering lexical diversity, sentence structure, or discourse coherence), and enables targeted control for bias mitigation and robust generation. The framework integrates geometry and information theory to illuminate cross-scale information flow and offers practical knobs for controllable generation in transparent, trustworthy AI systems. The work advances interpretability by linking representational geometry to functional behavior across scales and provides a principled path toward scale-specific editing and safety enhancements.
Abstract
We present Multi-Scale Manifold Alignment(MSMA), an information-geometric framework that decomposes LLM representations into local, intermediate, and global manifolds and learns cross-scale mappings that preserve geometry and information. Across GPT-2, BERT, RoBERTa, and T5, we observe consistent hierarchical patterns and find that MSMA improves alignment metrics under multiple estimators (e.g., relative KL reduction and MI gains with statistical significance across seeds). Controlled interventions at different scales yield distinct and architecture-dependent effects on lexical diversity, sentence structure, and discourse coherence. While our theoretical analysis relies on idealized assumptions, the empirical results suggest that multi-objective alignment offers a practical lens for analyzing cross-scale information flow and guiding representation-level control.
