TracrBench: Generating Interpretability Testbeds with Large Language Models

Hannes Thurnherr; Jérémy Scheurer

TracrBench: Generating Interpretability Testbeds with Large Language Models

Hannes Thurnherr, Jérémy Scheurer

TL;DR

This work presents a novel approach for generating interpretability test beds using large language models (LLMs) and introduces TracrBench, a novel dataset consisting of 121 manually written and LLM-generated, human-validated RASP programs and their corresponding transformer weights.

Abstract

Achieving a mechanistic understanding of transformer-based language models is an open challenge, especially due to their large number of parameters. Moreover, the lack of ground truth mappings between model weights and their functional roles hinders the effective evaluation of interpretability methods, impeding overall progress. Tracr, a method for generating compiled transformers with inherent ground truth mappings in RASP, has been proposed to address this issue. However, manually creating a large number of models needed for verifying interpretability methods is labour-intensive and time-consuming. In this work, we present a novel approach for generating interpretability test beds using large language models (LLMs) and introduce TracrBench, a novel dataset consisting of 121 manually written and LLM-generated, human-validated RASP programs and their corresponding transformer weights. During this process, we evaluate the ability of frontier LLMs to autonomously generate RASP programs and find that this task poses significant challenges. GPT-4-turbo, with a 20-shot prompt and best-of-5 sampling, correctly implements only 57 out of 101 test programs, necessitating the manual implementation of the remaining programs. With its 121 samples, TracrBench aims to serve as a valuable testbed for evaluating and comparing interpretability methods.

TracrBench: Generating Interpretability Testbeds with Large Language Models

TL;DR

Abstract

Paper Structure (11 sections, 4 figures)

This paper contains 11 sections, 4 figures.

Introduction
Method
Dataset
Experiments
Results
Related Work
Conclusion
Author Contributions
Complete list of Algorithms
Example Program
Full Prompt

Figures (4)

Figure 1: Results on the test set with 101 Tracr programs with pass-rate on the left and a normalized, difficulty-weighted score on the right (maximum score on both metrics is $1.0$). The 20-shot prompt with best-of-5-sampling achieves the best performance to other prompts. gpt-4-turbo-2024-04-09 and gpt-4o models achieve the best performance overall. However, the task is challenging for all models.
Figure 2: The description of the target algorithm to implement that is part of the prompt for the LLM.
Figure 3: We show the distribution of RASP function calls within TracrBench using Kernel Density Estimation. The plot shows that most programs have around 6 RASP function calls, while a smaller number of more complex programs form a long tail.
Figure 4: We compare the number of RASP functions and program lines as proxies for task difficulty. When plotting the pass-rate of gpt-4o-turbo on all programs, we can see that the number of RASP functions is a better indicator of task complexity than the total lines of code.

TracrBench: Generating Interpretability Testbeds with Large Language Models

TL;DR

Abstract

TracrBench: Generating Interpretability Testbeds with Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)