Table of Contents
Fetching ...

The Dual-Stream Transformer: Channelized Architecture for Interpretable Language Modeling

J. Clayton Kerce, Alexis Fox

TL;DR

The Dual-Stream Transformer is introduced, which decomposes the residual stream into two functionally distinct components: a token stream updated by attention and a context stream updated by feed-forward networks that provide a foundation for interpretable language models where internal structure is exposed by design.

Abstract

Standard transformers entangle all computation in a single residual stream, obscuring which components perform which functions. We introduce the Dual-Stream Transformer, which decomposes the residual stream into two functionally distinct components: a token stream updated by attention and a context stream updated by feed-forward networks. Information flow between attention heads is controlled through a hierarchy of mixing strategies, from fully independent (maximum interpretability) to dense (standard transformer behavior). This design exposes a tunable tradeoff between interpretability and performance. We measure this tradeoff on language modeling tasks at 29M parameters. Fully independent head mixing increases validation loss by 8\% relative to dense baselines. The recommended Kronecker mixing strategy, which permits scalar communication between heads while preserving within-head structure, costs only 2.5\%. All configurations maintain functional generation under attention amplification (scaling logits by factors up to 16 at inference time), with degradation ranging from 16\% to 27\%. This robustness suggests the architectures learn discrete algorithms that operate independently of soft probabilistic mixing. The architecture provides a foundation for interpretable language models where internal structure is exposed by design. \footnote{This work was partially supported by DARPA Contract HR001125C0302.}

The Dual-Stream Transformer: Channelized Architecture for Interpretable Language Modeling

TL;DR

The Dual-Stream Transformer is introduced, which decomposes the residual stream into two functionally distinct components: a token stream updated by attention and a context stream updated by feed-forward networks that provide a foundation for interpretable language models where internal structure is exposed by design.

Abstract

Standard transformers entangle all computation in a single residual stream, obscuring which components perform which functions. We introduce the Dual-Stream Transformer, which decomposes the residual stream into two functionally distinct components: a token stream updated by attention and a context stream updated by feed-forward networks. Information flow between attention heads is controlled through a hierarchy of mixing strategies, from fully independent (maximum interpretability) to dense (standard transformer behavior). This design exposes a tunable tradeoff between interpretability and performance. We measure this tradeoff on language modeling tasks at 29M parameters. Fully independent head mixing increases validation loss by 8\% relative to dense baselines. The recommended Kronecker mixing strategy, which permits scalar communication between heads while preserving within-head structure, costs only 2.5\%. All configurations maintain functional generation under attention amplification (scaling logits by factors up to 16 at inference time), with degradation ranging from 16\% to 27\%. This robustness suggests the architectures learn discrete algorithms that operate independently of soft probabilistic mixing. The architecture provides a foundation for interpretable language models where internal structure is exposed by design. \footnote{This work was partially supported by DARPA Contract HR001125C0302.}
Paper Structure (57 sections, 13 equations, 5 figures, 8 tables, 1 algorithm)

This paper contains 57 sections, 13 equations, 5 figures, 8 tables, 1 algorithm.

Figures (5)

  • Figure 1: Dual-Stream Transformer architecture. The residual stream is decomposed into a token stream $\mathbf{x}_t$ (orange, left) updated exclusively by attention, and a context stream $\mathbf{x}_e$ (blue, right) updated exclusively by FFN. Both streams are combined for computing queries, keys, and FFN inputs via Channel-aware Layer Normalization (CLN). Channelized mixing strategies (purple labels) control information flow between attention heads at each projection. The hierarchy from Identity to Dense provides tunable interpretability-performance tradeoffs.
  • Figure 2: Validation loss versus attention amplification factor. (a) Kronecker-Dense mixing degrades more gracefully than Independent-Dense, with 9.5% lower cumulative degradation across $\alpha \in [1,16]$. (b) Gated attention variants show similar amplification robustness to their non-gated counterparts. All configurations maintain functional generation at $\alpha=16$ with bounded degradation.
  • Figure 3: Learned Kronecker routing matrices across layers. Each heatmap shows the $6 \times 6$ head-to-head routing weights for (a) value projection and (b) output projection. Cell $(i,j)$ indicates routing from head $j$ to head $i$. Red indicates positive (excitatory) weights; blue indicates negative (inhibitory) weights. Routing strength increases in deeper layers, with maximum weights growing from $\sim$1.0 in layer 0 to $\sim$3.5 in layer 5.
  • Figure 4: Attention patterns under progressive amplification (FTS configuration, Layer 3). Distributions sharpen from soft mixing at baseline to near-deterministic selection at $\alpha=16$. The model maintains coherent predictions throughout this transition, indicating that the underlying computation can operate on discrete token selections.
  • Figure 5: Coreference resolution accuracy by head across architectures. Left: Independent-Dense configuration shows strong specialization with distinct functional roles per head. Right: Dense baseline distributes coreference computation across multiple heads with less differentiation. Architectural constraints promote interpretable specialization.