Table of Contents
Fetching ...

Simplifying Outcomes of Language Model Component Analyses with ELIA

Aaron Louis Eidt, Nils Feldhus

TL;DR

An interactive web application that simplifies the outcomes of various language model component analyses for a broader audience, and concludes that an AI system can indeed simplify complex model analyses, but its true power is unlocked when paired with thoughtful, user-centered design that prioritizes interactivity, specificity, and narrative guidance.

Abstract

While mechanistic interpretability has developed powerful tools to analyze the internal workings of Large Language Models (LLMs), their complexity has created an accessibility gap, limiting their use to specialists. We address this challenge by designing, building, and evaluating ELIA (Explainable Language Interpretability Analysis), an interactive web application that simplifies the outcomes of various language model component analyses for a broader audience. The system integrates three key techniques -- Attribution Analysis, Function Vector Analysis, and Circuit Tracing -- and introduces a novel methodology: using a vision-language model to automatically generate natural language explanations (NLEs) for the complex visualizations produced by these methods. The effectiveness of this approach was empirically validated through a mixed-methods user study, which revealed a clear preference for interactive, explorable interfaces over simpler, static visualizations. A key finding was that the AI-powered explanations helped bridge the knowledge gap for non-experts; a statistical analysis showed no significant correlation between a user's prior LLM experience and their comprehension scores, suggesting that the system reduced barriers to comprehension across experience levels. We conclude that an AI system can indeed simplify complex model analyses, but its true power is unlocked when paired with thoughtful, user-centered design that prioritizes interactivity, specificity, and narrative guidance.

Simplifying Outcomes of Language Model Component Analyses with ELIA

TL;DR

An interactive web application that simplifies the outcomes of various language model component analyses for a broader audience, and concludes that an AI system can indeed simplify complex model analyses, but its true power is unlocked when paired with thoughtful, user-centered design that prioritizes interactivity, specificity, and narrative guidance.

Abstract

While mechanistic interpretability has developed powerful tools to analyze the internal workings of Large Language Models (LLMs), their complexity has created an accessibility gap, limiting their use to specialists. We address this challenge by designing, building, and evaluating ELIA (Explainable Language Interpretability Analysis), an interactive web application that simplifies the outcomes of various language model component analyses for a broader audience. The system integrates three key techniques -- Attribution Analysis, Function Vector Analysis, and Circuit Tracing -- and introduces a novel methodology: using a vision-language model to automatically generate natural language explanations (NLEs) for the complex visualizations produced by these methods. The effectiveness of this approach was empirically validated through a mixed-methods user study, which revealed a clear preference for interactive, explorable interfaces over simpler, static visualizations. A key finding was that the AI-powered explanations helped bridge the knowledge gap for non-experts; a statistical analysis showed no significant correlation between a user's prior LLM experience and their comprehension scores, suggesting that the system reduced barriers to comprehension across experience levels. We conclude that an AI system can indeed simplify complex model analyses, but its true power is unlocked when paired with thoughtful, user-centered design that prioritizes interactivity, specificity, and narrative guidance.
Paper Structure (20 sections, 13 figures, 2 tables)

This paper contains 20 sections, 13 figures, 2 tables.

Figures (13)

  • Figure 1: ELIA system overview, including three analysis methods (Attribution Analysis, Function Vectors, and Circuit Tracer) and the explanation generation workflow using VLMs to transform complex interpretability analyses into accessible NLEs. The system is evaluated using a Faithfulness Checker and a user study.
  • Figure 2: The interactive Attribution Heatmap using Integrated Gradients with an AI-generated natural language explanation. The heatmap visualizes the influence of input tokens on the generated output, and the explanation interprets these results in an accessible narrative.
  • Figure 3: Function Vector and Circuit Trace Analysis visualizations. The 3D PCA plot (left) places the user's prompt in a semantic functional space, while the Circuit Graph (right) traces the flow of information through interpretable features across layers. Both are accompanied by AI-generated explanations.
  • Figure 4: Impact of intervention on model output probability ($|\Delta p|$). We compare the effect of ablating top-$k$ targeted features and traced circuits against random baselines (ablating random features or edges).
  • Figure 5: Grouped boxplots of UX ratings for all participants across the three analysis pages.
  • ...and 8 more figures