BlackboxNLP-2025 MIB Shared Task: Exploring Ensemble Strategies for Circuit Localization Methods
Philipp Mondorf, Mingyang Wang, Sebastian Gerstner, Ahmad Dawar Hakimi, Yihong Liu, Leonor Veloso, Shijia Zhou, Hinrich Schütze, Barbara Plank
TL;DR
This work investigates whether ensembling circuit localization methods can improve mechanistic interpretability benchmarks. By introducing parallel and sequential ensembling, and then combining them into a hybrid approach, the authors achieve consistent improvements in CPR (a higher score indicating better circuit identification) and CMD (a lower score indicating closer alignment with model behavior). Key contributions include demonstrating gains from warm-started edge pruning and from integrating multiple EAP-based baselines, with the hybrid ensemble achieving top leaderboard performance on private data. The findings suggest that leveraging complementary localization signals yields more accurate and robust circuit identification, with practical implications for scalable mechanistic interpretability in large language models. The approach emphasizes reproducibility and provides public code to facilitate further research.
Abstract
The Circuit Localization track of the Mechanistic Interpretability Benchmark (MIB) evaluates methods for localizing circuits within large language models (LLMs), i.e., subnetworks responsible for specific task behaviors. In this work, we investigate whether ensembling two or more circuit localization methods can improve performance. We explore two variants: parallel and sequential ensembling. In parallel ensembling, we combine attribution scores assigned to each edge by different methods-e.g., by averaging or taking the minimum or maximum value. In the sequential ensemble, we use edge attribution scores obtained via EAP-IG as a warm start for a more expensive but more precise circuit identification method, namely edge pruning. We observe that both approaches yield notable gains on the benchmark metrics, leading to a more precise circuit identification approach. Finally, we find that taking a parallel ensemble over various methods, including the sequential ensemble, achieves the best results. We evaluate our approach in the BlackboxNLP 2025 MIB Shared Task, comparing ensemble scores to official baselines across multiple model-task combinations.
