When to Think Fast and Slow? AMOR: Entropy-Based Metacognitive Gate for Dynamic SSM-Attention Switching
Haoran Zheng
TL;DR
AMOR introduces a dual-process-inspired hybrid for sequence modeling that processes all positions with an SSM backbone while selectively engaging sparse attention when prediction entropy is high. By projecting keys and values from SSM hidden states into a Ghost KV cache, AMOR achieves retrieval-enabled accuracy with substantially reduced attention computation, demonstrated on synthetic retrieval tasks where it attains perfect retrieval at a 22% gate rate. The approach offers interpretable, information-theoretic routing decisions and highlights efficiency gains over full attention, while identifying limits posed by SSM state decay and horizons. Overall, AMOR provides a principled, adaptable framework for dynamic computation allocation that mirrors human metacognitive strategies and opens avenues for persistent memory and feedback-driven enhancements.
Abstract
Transformers allocate uniform computation to every position, regardless of difficulty. State Space Models (SSMs) offer efficient alternatives but struggle with precise information retrieval over a long horizon. Inspired by dual-process theories of cognition (Kahneman, 2011), we propose AMOR (Adaptive Metacognitive Output Router), a hybrid architecture that dynamically engages sparse attention only when an SSM backbone is "uncertain"--as measured by prediction entropy. Compared to standard transformers, AMOR gains efficiency by projecting keys and values from SSM hidden states (Ghost KV), reusing the SSM's O(n) computation rather than requiring O(n^2) attention at every layer. On small-scale synthetic retrieval tasks, AMOR outperforms both SSM-only and transformer-only baselines, achieving perfect retrieval accuracy while engaging attention on only 22% of positions. We validate that prediction entropy reliably signals retrieval need, with a gap of 1.09 nats (nearly half the entropy range) between retrieval and local positions. Additionally, our approach provides interpretable adaptive computation, where routing decisions can be understood in information-theoretic terms.
