Meta-Models: An Architecture for Decoding LLM Behaviors Through Interpreted Embeddings and Natural Language

Anthony Costarelli; Mat Allen; Severin Field

Meta-Models: An Architecture for Decoding LLM Behaviors Through Interpreted Embeddings and Natural Language

Anthony Costarelli, Mat Allen, Severin Field

TL;DR

This work investigates meta-models-an architecture using a meta-model that takes activations from an input-model and answers natural language questions about the input-model's behaviors and shows that meta-models generalize well to out-of-distribution tasks.

Abstract

As Large Language Models (LLMs) become increasingly integrated into our daily lives, the potential harms from deceptive behavior underlie the need for faithfully interpreting their decision-making. While traditional probing methods have shown some effectiveness, they remain best for narrowly scoped tasks while more comprehensive explanations are still necessary. To this end, we investigate meta-models-an architecture using a "meta-model" that takes activations from an "input-model" and answers natural language questions about the input-model's behaviors. We evaluate the meta-model's ability to generalize by training them on selected task types and assessing their out-of-distribution performance in deceptive scenarios. Our findings show that meta-models generalize well to out-of-distribution tasks and point towards opportunities for future research in this area. Our code is available at https://github.com/acostarelli/meta-models-public .

Meta-Models: An Architecture for Decoding LLM Behaviors Through Interpreted Embeddings and Natural Language

TL;DR

Abstract

Paper Structure (18 sections, 2 figures)

This paper contains 18 sections, 2 figures.

Introduction
Contributions
Related Works
Mechanistic Interpretability
Automated Interpretability
Meta-Models
Activation Patching
Methodology
Dataset Utilization
Meta-Model Architecture
Empirical Findings
Generalization to lie-detection
Best and worst performance
Impact on performance from datasets
Performance on other models
...and 3 more sections

Figures (2)

Figure 1: Meta-model generalization accuracy to lie detection when trained on datasets. Crosses are averages of scores across two runs. Datasets are L: Language, E: Emotions, M: Multilingual, S: Sentiment. Datasets and Results are discussed in depth in Section \ref{['sec:Dataset Utilization']} and Section \ref{['sec:Empirical Findings']} respectively.
Figure 2: InternLM as meta-model Crosses are still averages of scores across two runs. Entire experiment setup used on meta-llama/Llama-3.1-8B-Instruct are recreated exactly here.

Meta-Models: An Architecture for Decoding LLM Behaviors Through Interpreted Embeddings and Natural Language

TL;DR

Abstract

Meta-Models: An Architecture for Decoding LLM Behaviors Through Interpreted Embeddings and Natural Language

Authors

TL;DR

Abstract

Table of Contents

Figures (2)