NNsight and NDIF: Democratizing Access to Open-Weight Foundation Model Internals

Jaden Fiotto-Kaufman; Alexander R. Loftus; Eric Todd; Jannik Brinkmann; Koyena Pal; Dmitrii Troitskii; Michael Ripa; Adam Belfki; Can Rager; Caden Juang; Aaron Mueller; Samuel Marks; Arnab Sen Sharma; Francesca Lucchetti; Nikhil Prakash; Carla Brodley; Arjun Guha; Jonathan Bell; Byron C. Wallace; David Bau

NNsight and NDIF: Democratizing Access to Open-Weight Foundation Model Internals

Jaden Fiotto-Kaufman, Alexander R. Loftus, Eric Todd, Jannik Brinkmann, Koyena Pal, Dmitrii Troitskii, Michael Ripa, Adam Belfki, Can Rager, Caden Juang, Aaron Mueller, Samuel Marks, Arnab Sen Sharma, Francesca Lucchetti, Nikhil Prakash, Carla Brodley, Arjun Guha, Jonathan Bell, Byron C. Wallace, David Bau

TL;DR

The paper addresses barriers to studying the internals of very large open-weight transformers by introducing an intervention-graph framework that decouples experimental design from model runtime. NNsight provides deferred, trace-enabled PyTorch integration, while NDIF offers a scalable, multi-user inference service to execute interventions on preloaded, sharded models. Through a survey of interpretability literature and a suite of performance benchmarks, the authors demonstrate that this architecture enables robust, reproducible large-scale experiments with reduced startup and data-transfer costs compared to HPC or peer-to-peer approaches. The work outlines practical benefits for transparency-focused AI research and discusses limitations around closed-model access and potential misuse, while inviting broader adoption by the research and industrial communities.

Abstract

We introduce NNsight and NDIF, technologies that work in tandem to enable scientific study of the representations and computations learned by very large neural networks. NNsight is an open-source system that extends PyTorch to introduce deferred remote execution. The National Deep Inference Fabric (NDIF) is a scalable inference service that executes NNsight requests, allowing users to share GPU resources and pretrained models. These technologies are enabled by the Intervention Graph, an architecture developed to decouple experimental design from model runtime. Together, this framework provides transparent and efficient access to the internals of deep neural networks such as very large language models (LLMs) without imposing the cost or complexity of hosting customized models individually. We conduct a quantitative survey of the machine learning literature that reveals a growing gap in the study of the internals of large-scale AI. We demonstrate the design and use of our framework to address this gap by enabling a range of research methods on huge models. Finally, we conduct benchmarks to compare performance with previous approaches. Code, documentation, and tutorials are available at https://nnsight.net/.

NNsight and NDIF: Democratizing Access to Open-Weight Foundation Model Internals

TL;DR

Abstract

Paper Structure (33 sections, 10 figures, 4 tables)

This paper contains 33 sections, 10 figures, 4 tables.

Introduction
Surveying model availability and research usage
A Framework for Experiments on Large-Scale AI
Representing experiments as graphs
Computation Graph.
Intervention Graph.
Familiar and expressive interventions
Co-Tenancy and remote infrastructure
Performance and Evaluation
Related Work
Discussion
Ethics Statement
Reproducibility Statement
Acknowledgements
Research Survey Details
...and 18 more sections

Figures (10)

Figure 1: An example of the implementation of an NNsight intervention graph: (a) A user writes research code from which (b) an intervention graph is constructed. (c) The intervention operations are interleaved with the original model's computation and then executed. Values marked with .save() are made available to the user upon completion.
Figure 2: Most interpretability research is done on models that lag far behind the capabilities available in either closed- or open-models. Each blue point represents the MMLU performance of the largest open-weight model studied by a surveyed paper, where the size of the point represents the model's parameter size. Models without a recorded MMLU score were interpolated with nearest neighbors. There is a significant gap between the performance of models studied (blue line) and the capabilities of leading open-weight models (shown in orange). This gap is extended even further when considering the performance of leading closed-weight models (black line). (a) A small group of papers study language models with $\geq$70% MMLU performance, which can account for at least some of the gap. However, many researchers are still studying smaller, less performant models that hover around baseline performance (shown in gray). (b) Studying smaller, but still capable models such as Qwen 72B or Yi-34B may be part of the solution to closing this research gap, but these models still underperform the leading open-weight model, Llama 3.1 405B.
Figure 3: Experiment code expressed using (a) standard PyTorch hooks and (b) the NNsight API. Both code snippets define the same intervention -- activating three neurons which cause the model to invert the meaning of its output (e.g., producing "lie" rather than "truth"). The PyTorch intervention code captured in six lines of code (a, lines 8-13), can be easily expressed using NNsight with three lines of code (b, lines 7-9). Standard PyTorch requires creating custom hooks for each access point, whereas with NNsight, all module inputs and outputs can be accessed within a single trace context.
Figure 4: An overview of the NNsight and NDIF remote system. Researchers write experiment code using the NNsight API which is converted to an Intervention Graph. The graph is serialized to a custom JSON format and sent as a request to the NDIF frontend server. The NDIF backend can host multiple model instances, each on a dedicated set of GPU nodes. For very large models, such as Llama 3.1 405B dubey2024llama3herdmodels, model weights are distributed across many shards using tensor parallelism. The router transfers the request to the head node (shard 0) of the requested model, via the Ray GCS Service moritz2018raydistributedframeworkemergingrayteam2022ray. Shard 0 sends the request to all other shards of the model where it is then deserialized and executed. Each shard receives the full intervention graph, but only manages a slice of the model parameters. The Torch NCCL Head manages distributed model execution across allocated shards. After the intervention graph has been executed, results are gathered at shard 0 and sent to the object store in the NDIF frontend. The shard 0 WebSocket client informs the WebSocket client on the local workstation about the completion of the intervention. As soon as the local WebSocket client notes the intervention is complete it pulls the final results from the Object Store and inserts the result back into the local intervention graph. The research code can pull results from the intervention graph that are requested via .save().
Figure 5: Schematic of research community use of NDIF vs. HPC and Petals. Green nodes show custom experiments. NDIF (left) allows many researchers to share a common inference service that runs customized experiments with shared memory. In HPC (center), researchers are responsible for weight-loading and handling model memory overhead on their own separate instances. In peer-to-peer swarm approaches like Petals (right), while GPU resources are shared, hidden states must be transferred between nodes during inference and returned to the user for custom interventions, resulting in costly data transfers.
...and 5 more figures

NNsight and NDIF: Democratizing Access to Open-Weight Foundation Model Internals

TL;DR

Abstract

NNsight and NDIF: Democratizing Access to Open-Weight Foundation Model Internals

Authors

TL;DR

Abstract

Table of Contents

Figures (10)