Table of Contents
Fetching ...

Data-Centric Interpretability for LLM-based Multi-Agent Reinforcement Learning

John Yan, Michael Yu, Yuqi Sun, Alexander Duffy, Tyler Marques, Matthew Lyle Olson

TL;DR

This work presents a data-centric interpretability framework for analyzing LLM-based multi-agent reinforcement learning in Full-Press Diplomacy by combining Sparse Autoencoders with hierarchical LLM summarization. The central novelty is Meta-Autointerp, which aggregates SAE features into interpretable meta-features and generates hypotheses about training dynamics, complemented by an LLM-summarization pipeline that highlights high-level strategic shifts. Across two user studies and automated validation, SAE Meta-Features yield the highest proportion of significant and predictive hypotheses, while LLM-derived hypotheses contribute global insights; together they offer complementary viewpoints that guide downstream interventions. An intervention using hypothesis-guided prompts improves game performance by 14.2%, and case studies reveal both early warning signals for bad runs and unexpected reward hacking patterns, illustrating practical utility for trustworthy LLM behavior during training.

Abstract

Large language models (LLMs) are increasingly trained in complex Reinforcement Learning, multi-agent environments, making it difficult to understand how behavior changes over training. Sparse Autoencoders (SAEs) have recently shown to be useful for data-centric interpretability. In this work, we analyze large-scale reinforcement learning training runs from the sophisticated environment of Full-Press Diplomacy by applying pretrained SAEs, alongside LLM-summarizer methods. We introduce Meta-Autointerp, a method for grouping SAE features into interpretable hypotheses about training dynamics. We discover fine-grained behaviors including role-playing patterns, degenerate outputs, language switching, alongside high-level strategic behaviors and environment-specific bugs. Through automated evaluation, we validate that 90% of discovered SAE Meta-Features are significant, and find a surprising reward hacking behavior. However, through two user studies, we find that even subjectively interesting and seemingly helpful SAE features may be worse than useless to humans, along with most LLM generated hypotheses. However, a subset of SAE-derived hypotheses are predictively useful for downstream tasks. We further provide validation by augmenting an untrained agent's system prompt, improving the score by +14.2%. Overall, we show that SAEs and LLM-summarizer provide complementary views into agent behavior, and together our framework forms a practical starting point for future data-centric interpretability work on ensuring trustworthy LLM behavior throughout training.

Data-Centric Interpretability for LLM-based Multi-Agent Reinforcement Learning

TL;DR

This work presents a data-centric interpretability framework for analyzing LLM-based multi-agent reinforcement learning in Full-Press Diplomacy by combining Sparse Autoencoders with hierarchical LLM summarization. The central novelty is Meta-Autointerp, which aggregates SAE features into interpretable meta-features and generates hypotheses about training dynamics, complemented by an LLM-summarization pipeline that highlights high-level strategic shifts. Across two user studies and automated validation, SAE Meta-Features yield the highest proportion of significant and predictive hypotheses, while LLM-derived hypotheses contribute global insights; together they offer complementary viewpoints that guide downstream interventions. An intervention using hypothesis-guided prompts improves game performance by 14.2%, and case studies reveal both early warning signals for bad runs and unexpected reward hacking patterns, illustrating practical utility for trustworthy LLM behavior during training.

Abstract

Large language models (LLMs) are increasingly trained in complex Reinforcement Learning, multi-agent environments, making it difficult to understand how behavior changes over training. Sparse Autoencoders (SAEs) have recently shown to be useful for data-centric interpretability. In this work, we analyze large-scale reinforcement learning training runs from the sophisticated environment of Full-Press Diplomacy by applying pretrained SAEs, alongside LLM-summarizer methods. We introduce Meta-Autointerp, a method for grouping SAE features into interpretable hypotheses about training dynamics. We discover fine-grained behaviors including role-playing patterns, degenerate outputs, language switching, alongside high-level strategic behaviors and environment-specific bugs. Through automated evaluation, we validate that 90% of discovered SAE Meta-Features are significant, and find a surprising reward hacking behavior. However, through two user studies, we find that even subjectively interesting and seemingly helpful SAE features may be worse than useless to humans, along with most LLM generated hypotheses. However, a subset of SAE-derived hypotheses are predictively useful for downstream tasks. We further provide validation by augmenting an untrained agent's system prompt, improving the score by +14.2%. Overall, we show that SAEs and LLM-summarizer provide complementary views into agent behavior, and together our framework forms a practical starting point for future data-centric interpretability work on ensuring trustworthy LLM behavior throughout training.
Paper Structure (69 sections, 5 equations, 12 figures, 4 tables)

This paper contains 69 sections, 5 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Overview of our framework. We generate hypotheses about training dynamics using LLM-summarizer and Meta-Autointerp over sparse autoencoder features. Hypotheses are validated using both automated LLM-based evaluations and human user studies.
  • Figure 2: Napoleon meta-feature training dynamics (top) and example activating spans (bottom).
  • Figure 3: Human evaluation of hypothesis predictive usefulness across 18 hypotheses. A majority of SAE-generated features improve early vs late classification, while LLM-generated hypotheses struggle to improve accuracy. Full results in Appendix \ref{['apdx:user_study_2_table']}.
  • Figure 4: Early representational divergence between successful and failed training runs. Step 6-9 is the early warning window where SAE features signal the divergence while reward curves remain indistinguishable.
  • Figure 5: SAE features detected highly correlated reward hacking and related non-reward hacking behaviors, validated by Regex.
  • ...and 7 more figures