Table of Contents
Fetching ...

Safety Under Scaffolding: How Evaluation Conditions Shape Measured Safety

David Gringras

TL;DR

One of the largest controlled studies of scaffold effects on safety is reported, combining pre-registration, assessor blinding, equivalence testing, equivalence testing, and specification curve analysis.

Abstract

Safety benchmarks evaluate language models in isolation, typically using multiple-choice format; production deployments wrap these models in agentic scaffolds that restructure inputs through reasoning traces, critic agents, and delegation pipelines. We report one of the largest controlled studies of scaffold effects on safety (N = 62,808; six frontier models, four deployment configurations), combining pre-registration, assessor blinding, equivalence testing, and specification curve analysis. Map-reduce scaffolding degrades measured safety (NNH = 14), yet two of three scaffold architectures preserve safety within practically meaningful margins. Investigating the map-reduce degradation revealed a deeper measurement problem: switching from multiple-choice to open-ended format on identical items shifts safety scores by 5-20 percentage points, larger than any scaffold effect. Within-format scaffold comparisons are consistent with practical equivalence under our pre-registered +/-2 pp TOST margin, isolating evaluation format rather than scaffold architecture as the operative variable. Model x scaffold interactions span 35 pp in opposing directions (one model degrades by -16.8 pp on sycophancy under map-reduce while another improves by +18.8 pp on the same benchmark), ruling out universal claims about scaffold safety. A generalisability analysis yields G = 0.000: model safety rankings reverse so completely across benchmarks that no composite safety index achieves non-zero reliability, making per-model, per-configuration testing a necessary minimum standard. We release all code, data, and prompts as ScaffoldSafety.

Safety Under Scaffolding: How Evaluation Conditions Shape Measured Safety

TL;DR

One of the largest controlled studies of scaffold effects on safety is reported, combining pre-registration, assessor blinding, equivalence testing, equivalence testing, and specification curve analysis.

Abstract

Safety benchmarks evaluate language models in isolation, typically using multiple-choice format; production deployments wrap these models in agentic scaffolds that restructure inputs through reasoning traces, critic agents, and delegation pipelines. We report one of the largest controlled studies of scaffold effects on safety (N = 62,808; six frontier models, four deployment configurations), combining pre-registration, assessor blinding, equivalence testing, and specification curve analysis. Map-reduce scaffolding degrades measured safety (NNH = 14), yet two of three scaffold architectures preserve safety within practically meaningful margins. Investigating the map-reduce degradation revealed a deeper measurement problem: switching from multiple-choice to open-ended format on identical items shifts safety scores by 5-20 percentage points, larger than any scaffold effect. Within-format scaffold comparisons are consistent with practical equivalence under our pre-registered +/-2 pp TOST margin, isolating evaluation format rather than scaffold architecture as the operative variable. Model x scaffold interactions span 35 pp in opposing directions (one model degrades by -16.8 pp on sycophancy under map-reduce while another improves by +18.8 pp on the same benchmark), ruling out universal claims about scaffold safety. A generalisability analysis yields G = 0.000: model safety rankings reverse so completely across benchmarks that no composite safety index achieves non-zero reliability, making per-model, per-configuration testing a necessary minimum standard. We release all code, data, and prompts as ScaffoldSafety.
Paper Structure (214 sections, 1 equation, 12 figures, 42 tables)

This paper contains 214 sections, 1 equation, 12 figures, 42 tables.

Figures (12)

  • Figure 1: Aggregate safety rates by model (rows) and deployment configuration (columns). Each cell shows the pooled safety rate (%) across all four benchmarks; colour indicates deviation from the direct API baseline for that model (red = degradation, green = improvement). Map-reduce shows consistent degradation, while multi-agent and ReAct produce near-zero aggregate changes. Benchmark-specific breakdowns are in Figure \ref{['fig:sensitivity']}. $N = 62{,}808$ scored observations across six models.
  • Figure 2: Safety rates by benchmark, model, and configuration. Stars indicate significant pairwise differences vs. direct baseline (BH-FDR $q < 0.05$). Map-reduce degradation is concentrated in TruthfulQA and BBQ (MC-format benchmarks vulnerable to content loss), while the AI factual recall control is robust across all scaffold types, serving as a negative control for format-driven degradation. $N = 62{,}808$ scored observations across six models.
  • Figure 3: Safety outcomes as a function of task-structure preservation across scaffold conditions. Each point represents one model under one deployment condition. Conditions are ordered by the approximate fraction of original task structure (MC options, context, system prompt) preserved in model sub-calls, estimated via propagation tracing on an instrumented subset (Section \ref{['sec:propagation']}). Because x-axis values are condition-level summaries rather than per-observation measurements, this figure is descriptive; the construct validity evidence for structure preservation as a mechanism comes from the option-preserving experiment (Section \ref{['sec:option_preserving']}), which directly manipulates structure retention. CoT and direct both preserve 100% of task structure; differences between them reflect reasoning-elicitation effects (Section \ref{['sec:cot']}).
  • Figure 4: Information bottleneck in map-reduce scaffolding. The decompose step strips MC answer options (retained in 0--4% of map-worker sub-calls) while preserving safety instructions in processing sub-calls (100% of map/reduce steps; 2% of the decompose routing step), explaining why map-reduce produces confidently wrong answers (89.8% of errors) rather than novel safety violations. Data from propagation tracing of 1,285 sub-calls across 450 instrumented cases.
  • Figure 5: Phase 2 confirmatory dose-response across six models ($N = 300$ per condition). Left: BBQ accuracy declines monotonically with bias-invocation intensity for all models except Opus. Right: TruthfulQA accuracy improves monotonically with misconception-invocation intensity. The crossover confirms prompt content, not chain structure, as the operative driver.
  • ...and 7 more figures