Data-Centric Interpretability for LLM-based Multi-Agent Reinforcement Learning
John Yan, Michael Yu, Yuqi Sun, Alexander Duffy, Tyler Marques, Matthew Lyle Olson
TL;DR
This work presents a data-centric interpretability framework for analyzing LLM-based multi-agent reinforcement learning in Full-Press Diplomacy by combining Sparse Autoencoders with hierarchical LLM summarization. The central novelty is Meta-Autointerp, which aggregates SAE features into interpretable meta-features and generates hypotheses about training dynamics, complemented by an LLM-summarization pipeline that highlights high-level strategic shifts. Across two user studies and automated validation, SAE Meta-Features yield the highest proportion of significant and predictive hypotheses, while LLM-derived hypotheses contribute global insights; together they offer complementary viewpoints that guide downstream interventions. An intervention using hypothesis-guided prompts improves game performance by 14.2%, and case studies reveal both early warning signals for bad runs and unexpected reward hacking patterns, illustrating practical utility for trustworthy LLM behavior during training.
Abstract
Large language models (LLMs) are increasingly trained in complex Reinforcement Learning, multi-agent environments, making it difficult to understand how behavior changes over training. Sparse Autoencoders (SAEs) have recently shown to be useful for data-centric interpretability. In this work, we analyze large-scale reinforcement learning training runs from the sophisticated environment of Full-Press Diplomacy by applying pretrained SAEs, alongside LLM-summarizer methods. We introduce Meta-Autointerp, a method for grouping SAE features into interpretable hypotheses about training dynamics. We discover fine-grained behaviors including role-playing patterns, degenerate outputs, language switching, alongside high-level strategic behaviors and environment-specific bugs. Through automated evaluation, we validate that 90% of discovered SAE Meta-Features are significant, and find a surprising reward hacking behavior. However, through two user studies, we find that even subjectively interesting and seemingly helpful SAE features may be worse than useless to humans, along with most LLM generated hypotheses. However, a subset of SAE-derived hypotheses are predictively useful for downstream tasks. We further provide validation by augmenting an untrained agent's system prompt, improving the score by +14.2%. Overall, we show that SAEs and LLM-summarizer provide complementary views into agent behavior, and together our framework forms a practical starting point for future data-centric interpretability work on ensuring trustworthy LLM behavior throughout training.
