Table of Contents
Fetching ...

TamperBench: Systematically Stress-Testing LLM Safety Under Fine-Tuning and Tampering

Saad Hossain, Tom Tseng, Punya Syon Pandey, Samanvay Vajpayee, Matthew Kowal, Nayeema Nonta, Samuel Simko, Stephen Casper, Zhijing Jin, Kellin Pelrine, Sirisha Rambhatla

TL;DR

TamperBench provides a unified, extensible framework to systematically stress-test the tamper resistance of open-weight LLMs against a broad suite of weight-space and latent-space attacks, while evaluating both safety (via StrongREJECT) and utility (via MMLU-Pro). By conducting hyperparameter sweeps and realistic threat modeling, the framework enables robust, reproducible comparisons across models and defenses. The experiments across 21 open-weight LLMs and nine tampering threats reveal pervasive vulnerability: jailbreak-tuning often yields the strongest safety breaches with preserved capability, and even defense-augmented variants can be compromised, though some defenses like Triplet and TAR show improvements at a cost to utility. The work highlights the urgent need for standardized tamper-resistance evaluation and provides an extensible platform to advance defenses, tests, and community contributions.

Abstract

As increasingly capable open-weight large language models (LLMs) are deployed, improving their tamper resistance against unsafe modifications, whether accidental or intentional, becomes critical to minimize risks. However, there is no standard approach to evaluate tamper resistance. Varied data sets, metrics, and tampering configurations make it difficult to compare safety, utility, and robustness across different models and defenses. To this end, we introduce TamperBench, the first unified framework to systematically evaluate the tamper resistance of LLMs. TamperBench (i) curates a repository of state-of-the-art weight-space fine-tuning attacks and latent-space representation attacks; (ii) enables realistic adversarial evaluation through systematic hyperparameter sweeps per attack-model pair; and (iii) provides both safety and utility evaluations. TamperBench requires minimal additional code to specify any fine-tuning configuration, alignment-stage defense method, and metric suite while ensuring end-to-end reproducibility. We use TamperBench to evaluate 21 open-weight LLMs, including defense-augmented variants, across nine tampering threats using standardized safety and capability metrics with hyperparameter sweeps per model-attack pair. This yields novel insights, including effects of post-training on tamper resistance, that jailbreak-tuning is typically the most severe attack, and that Triplet emerges as a leading alignment-stage defense. Code is available at: https://github.com/criticalml-uw/TamperBench

TamperBench: Systematically Stress-Testing LLM Safety Under Fine-Tuning and Tampering

TL;DR

TamperBench provides a unified, extensible framework to systematically stress-test the tamper resistance of open-weight LLMs against a broad suite of weight-space and latent-space attacks, while evaluating both safety (via StrongREJECT) and utility (via MMLU-Pro). By conducting hyperparameter sweeps and realistic threat modeling, the framework enables robust, reproducible comparisons across models and defenses. The experiments across 21 open-weight LLMs and nine tampering threats reveal pervasive vulnerability: jailbreak-tuning often yields the strongest safety breaches with preserved capability, and even defense-augmented variants can be compromised, though some defenses like Triplet and TAR show improvements at a cost to utility. The work highlights the urgent need for standardized tamper-resistance evaluation and provides an extensible platform to advance defenses, tests, and community contributions.

Abstract

As increasingly capable open-weight large language models (LLMs) are deployed, improving their tamper resistance against unsafe modifications, whether accidental or intentional, becomes critical to minimize risks. However, there is no standard approach to evaluate tamper resistance. Varied data sets, metrics, and tampering configurations make it difficult to compare safety, utility, and robustness across different models and defenses. To this end, we introduce TamperBench, the first unified framework to systematically evaluate the tamper resistance of LLMs. TamperBench (i) curates a repository of state-of-the-art weight-space fine-tuning attacks and latent-space representation attacks; (ii) enables realistic adversarial evaluation through systematic hyperparameter sweeps per attack-model pair; and (iii) provides both safety and utility evaluations. TamperBench requires minimal additional code to specify any fine-tuning configuration, alignment-stage defense method, and metric suite while ensuring end-to-end reproducibility. We use TamperBench to evaluate 21 open-weight LLMs, including defense-augmented variants, across nine tampering threats using standardized safety and capability metrics with hyperparameter sweeps per model-attack pair. This yields novel insights, including effects of post-training on tamper resistance, that jailbreak-tuning is typically the most severe attack, and that Triplet emerges as a leading alignment-stage defense. Code is available at: https://github.com/criticalml-uw/TamperBench
Paper Structure (53 sections, 13 figures, 3 tables)

This paper contains 53 sections, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Tampering LLMs, as defined by che2025model, involves modifying their weights or latent representations and can compromise safety guardrails, yielding models that can output harmful responses. While numerous methods have been proposed to make models tamper-resistant, there is a lack of a systematic framework to measure this. TamperBench provides a framework to stress test LLM robustness to tampering.
  • Figure 2: While many alignment stage defenses have been proposed (e.g., tar_iclrvaccine_nipszou2024improvingsheshadri2025latent), they do not share a standardized evaluation, making comparisons between the approaches inconclusive. This motivates TamperBench as the first framework to consolidate tampering attacks and evaluations into a unified toolkit.
  • Figure 3: TamperBench evaluates a broad range of model tampering that may compromise safeguards, and assesses both safetyfn:refusal and capabilities after adaptation. Tampering is taxonomized based on the model adaptor's intent: malicious or benign (accidental). Malicious attacks are further divided into direct, overt ones, and covert ones originally designed to bypass closed-weight moderation safeguards.
  • Figure 4: A single script can be run to benchmark an LLM by providing either a local checkpoint path or a HuggingFace repository ID, along with a list of attack names. The toolkit then executes the specified tampering attacks and evaluation modules, producing results scored with standardized safety and utility metrics and cached for reproducibility. TamperBench is designed to be highly extensible, enabling researchers to contribute methods with minimal code overhead.
  • Figure 5: Benchmarking tamper resistant refusal of harmful requests across 21 open-weight LLMs. For each model--attack pair, we select the configuration from our hyperparameter sweeps that maximizes harmfulness (StrongREJECT score) while constraining utility loss to $\leq 10\%$ MMLU-Pro drop relative to the untampered baseline. Rows correspond to tampering attacks grouped by threat type. Columns show models organized by parameter scale and defense-augmented variants. Darker cells indicate higher harmfulnessfn:harmfulness; lighter cells indicate greater tamper resistance.
  • ...and 8 more figures