Table of Contents
Fetching ...

Towards Worst-Case Guarantees with Scale-Aware Interpretability

Lauren Greenspan, David Berman, Aryeh Brill, Ro Jefferson, Artemy Kolchinsky, Jennifer Lin, Andrew Mack, Anindita Maiti, Fernando E. Rosas, Alexander Stapleton, Lucas Teixeira, Dmitry Vaintrob

TL;DR

This work addresses the challenge of interpretability with worst-case guarantees by introducing scale-aware interpretability anchored in renormalisation theory. It formalizes implicit and explicit RG-style strategies for neural networks, proposing a framework to identify model-natural scales, effective degrees of freedom, and bounds on the influence of neglected details. The paper outlines concrete research artifacts (TMR and GRT) and an evaluation protocol to study separation of scales and potential universality in NN behavior, positioning renormalisation as a principled path to robust, faithful explanations. This approach aims to provide scalable, theory-informed interpretability tools with practical safety implications across AI systems, while acknowledging uncertainties and the need for cross-disciplinary collaboration.

Abstract

Neural networks organize information according to the hierarchical, multi-scale structure of natural data. Methods to interpret model internals should be similarly scale-aware, explicitly tracking how features compose across resolutions and guaranteeing bounds on the influence of fine-grained structure that is discarded as irrelevant noise. We posit that the renormalisation framework from physics can meet this need by offering technical tools that can overcome limitations of current methods. Moreover, relevant work from adjacent fields has now matured to a point where scattered research threads can be synthesized into practical, theory-informed tools. To combine these threads in an AI safety context, we propose a unifying research agenda -- \emph{scale-aware interpretability} -- to develop formal machinery and interpretability tools that have robustness and faithfulness properties supported by statistical physics.

Towards Worst-Case Guarantees with Scale-Aware Interpretability

TL;DR

This work addresses the challenge of interpretability with worst-case guarantees by introducing scale-aware interpretability anchored in renormalisation theory. It formalizes implicit and explicit RG-style strategies for neural networks, proposing a framework to identify model-natural scales, effective degrees of freedom, and bounds on the influence of neglected details. The paper outlines concrete research artifacts (TMR and GRT) and an evaluation protocol to study separation of scales and potential universality in NN behavior, positioning renormalisation as a principled path to robust, faithful explanations. This approach aims to provide scalable, theory-informed interpretability tools with practical safety implications across AI systems, while acknowledging uncertainties and the need for cross-disciplinary collaboration.

Abstract

Neural networks organize information according to the hierarchical, multi-scale structure of natural data. Methods to interpret model internals should be similarly scale-aware, explicitly tracking how features compose across resolutions and guaranteeing bounds on the influence of fine-grained structure that is discarded as irrelevant noise. We posit that the renormalisation framework from physics can meet this need by offering technical tools that can overcome limitations of current methods. Moreover, relevant work from adjacent fields has now matured to a point where scattered research threads can be synthesized into practical, theory-informed tools. To combine these threads in an AI safety context, we propose a unifying research agenda -- \emph{scale-aware interpretability} -- to develop formal machinery and interpretability tools that have robustness and faithfulness properties supported by statistical physics.
Paper Structure (41 sections)