Table of Contents
Fetching ...

When to Think Fast and Slow? AMOR: Entropy-Based Metacognitive Gate for Dynamic SSM-Attention Switching

Haoran Zheng

TL;DR

AMOR introduces a dual-process-inspired hybrid for sequence modeling that processes all positions with an SSM backbone while selectively engaging sparse attention when prediction entropy is high. By projecting keys and values from SSM hidden states into a Ghost KV cache, AMOR achieves retrieval-enabled accuracy with substantially reduced attention computation, demonstrated on synthetic retrieval tasks where it attains perfect retrieval at a 22% gate rate. The approach offers interpretable, information-theoretic routing decisions and highlights efficiency gains over full attention, while identifying limits posed by SSM state decay and horizons. Overall, AMOR provides a principled, adaptable framework for dynamic computation allocation that mirrors human metacognitive strategies and opens avenues for persistent memory and feedback-driven enhancements.

Abstract

Transformers allocate uniform computation to every position, regardless of difficulty. State Space Models (SSMs) offer efficient alternatives but struggle with precise information retrieval over a long horizon. Inspired by dual-process theories of cognition (Kahneman, 2011), we propose AMOR (Adaptive Metacognitive Output Router), a hybrid architecture that dynamically engages sparse attention only when an SSM backbone is "uncertain"--as measured by prediction entropy. Compared to standard transformers, AMOR gains efficiency by projecting keys and values from SSM hidden states (Ghost KV), reusing the SSM's O(n) computation rather than requiring O(n^2) attention at every layer. On small-scale synthetic retrieval tasks, AMOR outperforms both SSM-only and transformer-only baselines, achieving perfect retrieval accuracy while engaging attention on only 22% of positions. We validate that prediction entropy reliably signals retrieval need, with a gap of 1.09 nats (nearly half the entropy range) between retrieval and local positions. Additionally, our approach provides interpretable adaptive computation, where routing decisions can be understood in information-theoretic terms.

When to Think Fast and Slow? AMOR: Entropy-Based Metacognitive Gate for Dynamic SSM-Attention Switching

TL;DR

AMOR introduces a dual-process-inspired hybrid for sequence modeling that processes all positions with an SSM backbone while selectively engaging sparse attention when prediction entropy is high. By projecting keys and values from SSM hidden states into a Ghost KV cache, AMOR achieves retrieval-enabled accuracy with substantially reduced attention computation, demonstrated on synthetic retrieval tasks where it attains perfect retrieval at a 22% gate rate. The approach offers interpretable, information-theoretic routing decisions and highlights efficiency gains over full attention, while identifying limits posed by SSM state decay and horizons. Overall, AMOR provides a principled, adaptable framework for dynamic computation allocation that mirrors human metacognitive strategies and opens avenues for persistent memory and feedback-driven enhancements.

Abstract

Transformers allocate uniform computation to every position, regardless of difficulty. State Space Models (SSMs) offer efficient alternatives but struggle with precise information retrieval over a long horizon. Inspired by dual-process theories of cognition (Kahneman, 2011), we propose AMOR (Adaptive Metacognitive Output Router), a hybrid architecture that dynamically engages sparse attention only when an SSM backbone is "uncertain"--as measured by prediction entropy. Compared to standard transformers, AMOR gains efficiency by projecting keys and values from SSM hidden states (Ghost KV), reusing the SSM's O(n) computation rather than requiring O(n^2) attention at every layer. On small-scale synthetic retrieval tasks, AMOR outperforms both SSM-only and transformer-only baselines, achieving perfect retrieval accuracy while engaging attention on only 22% of positions. We validate that prediction entropy reliably signals retrieval need, with a gap of 1.09 nats (nearly half the entropy range) between retrieval and local positions. Additionally, our approach provides interpretable adaptive computation, where routing decisions can be understood in information-theoretic terms.
Paper Structure (59 sections, 8 equations, 6 figures, 11 tables)

This paper contains 59 sections, 8 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Amor Architecture. An SSM (System 1) processes each token, producing predictions and hidden states. The entropy gate monitors prediction entropy: high entropy ("I don't know") triggers sparse attention (System 2) over the Ghost KV cache—keys and values projected from SSM hidden states, providing temporally-aware representations for retrieval.
  • Figure 2: Entropy distribution at local vs retrieval positions. The clear bimodal separation validates entropy as a routing signal.
  • Figure 3: Entropy gap evolution during training. The model learns to be uncertain at retrieval positions while remaining confident at local positions.
  • Figure 4: SSM state decay vs noise length. Retrieval accuracy drops sharply as noise length increases beyond the SSM's state retention horizon ($\sim$50 tokens).
  • Figure 5: Gate firing pattern on an example sequence. Red indicates gate fires (attention engaged), green indicates gate skips (SSM-only processing).
  • ...and 1 more figures