Table of Contents
Fetching ...

pyvene: A Library for Understanding and Improving PyTorch Models via Interventions

Zhengxuan Wu, Atticus Geiger, Aryaman Arora, Jing Huang, Zheng Wang, Noah D. Goodman, Christopher D. Manning, Christopher Potts

TL;DR

pyvene addresses the need for a unified, extensible library to perform interventions on PyTorch models, enabling manipulation of internal activations across single or multiple forward passes and decoding steps. The approach centers on an intervention-oriented architecture with serializable configurations and an IntervenableModel decorator to apply, share, and reproduce complex interventions. The paper demonstrates two case studies—reproducing factual association localization in GPT2-XL and conducting intervention training with Pythia-6.9B for gender localization—highlighting both trainable interventions and probe-based baselines. Overall, pyvene offers a practical, shareable toolkit for interpretability and robustness research, fostering reproducibility and broader adoption through tutorials and a model hub.

Abstract

Interventions on model-internal states are fundamental operations in many areas of AI, including model editing, steering, robustness, and interpretability. To facilitate such research, we introduce $\textbf{pyvene}$, an open-source Python library that supports customizable interventions on a range of different PyTorch modules. $\textbf{pyvene}$ supports complex intervention schemes with an intuitive configuration format, and its interventions can be static or include trainable parameters. We show how $\textbf{pyvene}$ provides a unified and extensible framework for performing interventions on neural models and sharing the intervened upon models with others. We illustrate the power of the library via interpretability analyses using causal abstraction and knowledge localization. We publish our library through Python Package Index (PyPI) and provide code, documentation, and tutorials at https://github.com/stanfordnlp/pyvene.

pyvene: A Library for Understanding and Improving PyTorch Models via Interventions

TL;DR

pyvene addresses the need for a unified, extensible library to perform interventions on PyTorch models, enabling manipulation of internal activations across single or multiple forward passes and decoding steps. The approach centers on an intervention-oriented architecture with serializable configurations and an IntervenableModel decorator to apply, share, and reproduce complex interventions. The paper demonstrates two case studies—reproducing factual association localization in GPT2-XL and conducting intervention training with Pythia-6.9B for gender localization—highlighting both trainable interventions and probe-based baselines. Overall, pyvene offers a practical, shareable toolkit for interpretability and robustness research, fostering reproducibility and broader adoption through tutorials and a model hub.

Abstract

Interventions on model-internal states are fundamental operations in many areas of AI, including model editing, steering, robustness, and interpretability. To facilitate such research, we introduce , an open-source Python library that supports customizable interventions on a range of different PyTorch modules. supports complex intervention schemes with an intuitive configuration format, and its interventions can be static or include trainable parameters. We show how provides a unified and extensible framework for performing interventions on neural models and sharing the intervened upon models with others. We illustrate the power of the library via interpretability analyses using causal abstraction and knowledge localization. We publish our library through Python Package Index (PyPI) and provide code, documentation, and tutorials at https://github.com/stanfordnlp/pyvene.
Paper Structure (15 sections, 3 figures)

This paper contains 15 sections, 3 figures.

Figures (3)

  • Figure 1: An inference-time intervention li2023inferencetime on TinyStories-33M. The model is prompted with "Once upon a time there was a", and is asked to complete the story. We add a static word embedding (for "happy" or "sad") into the MLP output at each decoding step for all layers with a coefficient of 0.3. pyvene's complete implementation is provided. The original and intervened generations use greedy decoding.
  • Figure 2: We reproduce the results in meng2022locating's Figure 1 of locating early sites and late sites of factual associations in GPT2-XL in about 20 lines of pyvene code. The causal impact on output probability is mapped for the effect of each Transformer block output (left), MLP activations (middle), and attention layer output (right) .
  • Figure 3: Results of interchange intervention accuracy (IIA) with the trainable intervention (DAS) and accuracy with the trainable linear probe on different model components when localizing gender information.