Interpretable-by-Design Transformers via Architectural Stream Independence

Clayton Kerce; Alexis Fox

Interpretable-by-Design Transformers via Architectural Stream Independence

Clayton Kerce, Alexis Fox

TL;DR

LFA demonstrates that architectural constraints improve underlying learning mechanisms, with extremes from 50% on LFA's best pairs down to 0% complete collapse in over-constrained cases, establishing interpretability as an architectural design criterion enforceable through structural constraints rather than post-hoc analysis.

Abstract

While transformers achieve strong performance, their internal decision-making processes remain opaque. We investigate whether architectural constraints can enforce interpretability by design through architectural stream independence: maintaining a token stream (carrying symbolic structure) and contextual semantics in separated streams that remain independently observable throughout processing, with integration delayed until output. We validate this principle through the Late Fusion Architecture (LFA), which demonstrates interpretable symbolic heads through all the final layers, while standard transformers show dissolution by the third of six layers; we quantify this effect by introducing the Token-Position Dependence Score (PDS), with $PDS_{max}$ = 0.276 and 0.058, respectively. Crucially, intervention experiments demonstrate functional modularity: suppressing LFA's recency heads causes minimal semantic damage (Cohen's d = -0.158) versus catastrophic entanglement in baselines (d = -0.672). LFA demonstrates that architectural constraints improve underlying learning mechanisms, averaging 42% stability versus 19% and 11% for baseline comparisons, with extremes from 50% on LFA's best pairs (6 of 12 heads position-invariant) down to 0% complete collapse in over-constrained cases. By preventing premature entanglement, architectural independence steers models toward semantic understanding over positional heuristics, establishing interpretability as an architectural design criterion enforceable through structural constraints rather than post-hoc analysis.

Interpretable-by-Design Transformers via Architectural Stream Independence

TL;DR

Abstract

= 0.276 and 0.058, respectively. Crucially, intervention experiments demonstrate functional modularity: suppressing LFA's recency heads causes minimal semantic damage (Cohen's d = -0.158) versus catastrophic entanglement in baselines (d = -0.672). LFA demonstrates that architectural constraints improve underlying learning mechanisms, averaging 42% stability versus 19% and 11% for baseline comparisons, with extremes from 50% on LFA's best pairs (6 of 12 heads position-invariant) down to 0% complete collapse in over-constrained cases. By preventing premature entanglement, architectural independence steers models toward semantic understanding over positional heuristics, establishing interpretability as an architectural design criterion enforceable through structural constraints rather than post-hoc analysis.

Paper Structure (44 sections, 13 equations, 7 figures, 5 tables)

This paper contains 44 sections, 13 equations, 7 figures, 5 tables.

Introduction
Notation.
Architecture and Design Principles
Architectural Stream Independence: The Design Principle
Stream Separation Mechanism
Distinction from Prior Disentanglement
Why Asymmetric Flow Maintains Architectural Independence
Architectural Constraints
Model Configurations
Explainability Methodology: Isolating Positional from Semantic Reasoning
Coreference Resolution Analysis
Token-Position Dependence Score
Measuring Functional Transparency via Intervention
Transparent Functional Decomposition
Coreference Head Specialization
...and 29 more sections

Figures (7)

Figure 1: Temporal validation of architectural stream independence. LFA concentrates positional processing in layers 4-5, maintaining complete stream separation throughout all transformer layers. Std-T processes position in layers 0-1, causing immediate entanglement and dissolution by mid-layers. CFM's mid-layer concentration reflects failed integration---excessive constraint prevents late-layer coordination. This layer distribution directly validates the architectural hypothesis: separated streams preserve distinct symbolic channels in deep layers.
Figure 2: Functional transparency via intervention. SPS under baseline and recency head suppression (PDS $>$ 0.075). LFA demonstrates functional independence ($d = -0.158$), Std-T shows moderate entanglement ($d = -0.298$), CFM reveals complete opacity ($d = -0.672$). Error bars: SEM.
Figure 3: LFA architecture showing frozen symbolic stream $X_T$ and evolving embedding stream $X_E$.
Figure 4: PDS distribution across all heads. LFA shows bimodal separation between position-dependent (PDS $> 0.075$) and position-invariant heads. Std-T and CFM show unimodal distributions centered near zero, indicating position signals dissolve or fail to form.
Figure 5: Gate response curves for top-1 recency head intervention. LFA shows shallow, linear response (functional independence). Std-T and CFM show steep, non-linear collapse (entanglement).
...and 2 more figures

Interpretable-by-Design Transformers via Architectural Stream Independence

TL;DR

Abstract

Interpretable-by-Design Transformers via Architectural Stream Independence

Authors

TL;DR

Abstract

Table of Contents

Figures (7)