Table of Contents
Fetching ...

Engineering Verifiable Modularity in Transformers via Per-Layer Supervision

J. Clayton Kerce

Abstract

Transformers resist surgical control. Ablating an attention head identified as critical for capitalization produces minimal behavioral change because distributed redundancy compensates for damage. This Hydra effect renders interpretability illusory: we may identify components through correlation, but cannot predict or control their causal role. We demonstrate that architectural interventions can expose hidden modularity. Our approach combines dual-stream processing separating token and contextual representations, per-layer supervision providing independent gradient signal at each depth, and gated attention regularizing toward discrete activation patterns. When trained with per-layer supervision, models produce ablation effects 5 to 23 times larger than architecturally identical controls trained with standard objectives. This enables 4 times greater control leverage on targeted behaviors: scaling identified attention heads produces smooth, predictable changes in model output. The key finding is architectural. Without per-layer supervision, ablation damage concentrates near zero with low variance (Winograd standard deviation 0.63%). With per-layer supervision, effects spread widely (standard deviation 6.32%), revealing which predictions depend on which circuits. The larger variance is not measurement noise but the signature of unmasked modularity. We validate our approach through three components: engineered features that capture computational dynamics rather than vocabulary structure (validated by near-zero correlation with raw activation clustering), an architecture providing positive control for modularity, and causal experiments demonstrating functional reorganization where different tasks route through different attention heads. This es tablishes a methodology for transforming interpretability from passive observation to active control.

Engineering Verifiable Modularity in Transformers via Per-Layer Supervision

Abstract

Transformers resist surgical control. Ablating an attention head identified as critical for capitalization produces minimal behavioral change because distributed redundancy compensates for damage. This Hydra effect renders interpretability illusory: we may identify components through correlation, but cannot predict or control their causal role. We demonstrate that architectural interventions can expose hidden modularity. Our approach combines dual-stream processing separating token and contextual representations, per-layer supervision providing independent gradient signal at each depth, and gated attention regularizing toward discrete activation patterns. When trained with per-layer supervision, models produce ablation effects 5 to 23 times larger than architecturally identical controls trained with standard objectives. This enables 4 times greater control leverage on targeted behaviors: scaling identified attention heads produces smooth, predictable changes in model output. The key finding is architectural. Without per-layer supervision, ablation damage concentrates near zero with low variance (Winograd standard deviation 0.63%). With per-layer supervision, effects spread widely (standard deviation 6.32%), revealing which predictions depend on which circuits. The larger variance is not measurement noise but the signature of unmasked modularity. We validate our approach through three components: engineered features that capture computational dynamics rather than vocabulary structure (validated by near-zero correlation with raw activation clustering), an architecture providing positive control for modularity, and causal experiments demonstrating functional reorganization where different tasks route through different attention heads. This es tablishes a methodology for transforming interpretability from passive observation to active control.
Paper Structure (108 sections, 37 equations, 6 figures, 3 tables)

This paper contains 108 sections, 37 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Raw activation clusters capture vocabulary, not computation. Clustering layer-5 activations (768D) produces partitions dominated by token type. Our engineered features discover orthogonal structure validated through causal experiments (Section \ref{['sec:causal']}).
  • Figure 2: Depth distribution comparison. PLS (left) shows bimodal structure: 26% of predictions converge at layers 0--1, with a second peak at maximum depth. C2 (right) concentrates in middle-to-late layers.
  • Figure 3: Capitalization steering via attention scaling. PLS (blue) shows 4$\times$ greater control range than C2 (orange). Scaling entity head attention from 0 to 1.5$\times$ produces smooth, monotonic changes in capitalization probability.
  • Figure 4: The modularity signature. Task-specific routing in PLS versus entangled routing in C2. Variance difference reveals exposed versus hidden circuitry.
  • Figure 5: UMAP projection of sorted top-5 features. 555 clusters identified by HDBSCAN (colored), with unassigned points in gray (50,453 tokens). Clusters exhibit clear spatial separation corresponding to distinct computational modes.
  • ...and 1 more figures