Table of Contents
Fetching ...

Interpretable-by-Design Transformers via Architectural Stream Independence

Clayton Kerce, Alexis Fox

TL;DR

LFA demonstrates that architectural constraints improve underlying learning mechanisms, with extremes from 50% on LFA's best pairs down to 0% complete collapse in over-constrained cases, establishing interpretability as an architectural design criterion enforceable through structural constraints rather than post-hoc analysis.

Abstract

While transformers achieve strong performance, their internal decision-making processes remain opaque. We investigate whether architectural constraints can enforce interpretability by design through architectural stream independence: maintaining a token stream (carrying symbolic structure) and contextual semantics in separated streams that remain independently observable throughout processing, with integration delayed until output. We validate this principle through the Late Fusion Architecture (LFA), which demonstrates interpretable symbolic heads through all the final layers, while standard transformers show dissolution by the third of six layers; we quantify this effect by introducing the Token-Position Dependence Score (PDS), with $PDS_{max}$ = 0.276 and 0.058, respectively. Crucially, intervention experiments demonstrate functional modularity: suppressing LFA's recency heads causes minimal semantic damage (Cohen's d = -0.158) versus catastrophic entanglement in baselines (d = -0.672). LFA demonstrates that architectural constraints improve underlying learning mechanisms, averaging 42% stability versus 19% and 11% for baseline comparisons, with extremes from 50% on LFA's best pairs (6 of 12 heads position-invariant) down to 0% complete collapse in over-constrained cases. By preventing premature entanglement, architectural independence steers models toward semantic understanding over positional heuristics, establishing interpretability as an architectural design criterion enforceable through structural constraints rather than post-hoc analysis.

Interpretable-by-Design Transformers via Architectural Stream Independence

TL;DR

LFA demonstrates that architectural constraints improve underlying learning mechanisms, with extremes from 50% on LFA's best pairs down to 0% complete collapse in over-constrained cases, establishing interpretability as an architectural design criterion enforceable through structural constraints rather than post-hoc analysis.

Abstract

While transformers achieve strong performance, their internal decision-making processes remain opaque. We investigate whether architectural constraints can enforce interpretability by design through architectural stream independence: maintaining a token stream (carrying symbolic structure) and contextual semantics in separated streams that remain independently observable throughout processing, with integration delayed until output. We validate this principle through the Late Fusion Architecture (LFA), which demonstrates interpretable symbolic heads through all the final layers, while standard transformers show dissolution by the third of six layers; we quantify this effect by introducing the Token-Position Dependence Score (PDS), with = 0.276 and 0.058, respectively. Crucially, intervention experiments demonstrate functional modularity: suppressing LFA's recency heads causes minimal semantic damage (Cohen's d = -0.158) versus catastrophic entanglement in baselines (d = -0.672). LFA demonstrates that architectural constraints improve underlying learning mechanisms, averaging 42% stability versus 19% and 11% for baseline comparisons, with extremes from 50% on LFA's best pairs (6 of 12 heads position-invariant) down to 0% complete collapse in over-constrained cases. By preventing premature entanglement, architectural independence steers models toward semantic understanding over positional heuristics, establishing interpretability as an architectural design criterion enforceable through structural constraints rather than post-hoc analysis.
Paper Structure (44 sections, 13 equations, 7 figures, 5 tables)

This paper contains 44 sections, 13 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Temporal validation of architectural stream independence. LFA concentrates positional processing in layers 4-5, maintaining complete stream separation throughout all transformer layers. Std-T processes position in layers 0-1, causing immediate entanglement and dissolution by mid-layers. CFM's mid-layer concentration reflects failed integration---excessive constraint prevents late-layer coordination. This layer distribution directly validates the architectural hypothesis: separated streams preserve distinct symbolic channels in deep layers.
  • Figure 2: Functional transparency via intervention. SPS under baseline and recency head suppression (PDS $>$ 0.075). LFA demonstrates functional independence ($d = -0.158$), Std-T shows moderate entanglement ($d = -0.298$), CFM reveals complete opacity ($d = -0.672$). Error bars: SEM.
  • Figure 3: LFA architecture showing frozen symbolic stream $X_T$ and evolving embedding stream $X_E$.
  • Figure 4: PDS distribution across all heads. LFA shows bimodal separation between position-dependent (PDS $> 0.075$) and position-invariant heads. Std-T and CFM show unimodal distributions centered near zero, indicating position signals dissolve or fail to form.
  • Figure 5: Gate response curves for top-1 recency head intervention. LFA shows shallow, linear response (functional independence). Std-T and CFM show steep, non-linear collapse (entanglement).
  • ...and 2 more figures