Table of Contents
Fetching ...

OTCE: Hybrid SSM and Attention with Cross Domain Mixture of Experts to construct Observer-Thinker-Conceiver-Expresser

Jingze Shi, Ting Xie, Bingheng Wu, Chunjun Zheng, Kai Wang

TL;DR

OTCE addresses the challenge of modeling long-context language with efficient computation by blending a selective state space model (SSM) with self-attention, bridged by a Rotational Positional Encoding scheme. The Observer-Thinker-Conceiver-Expresser architecture, coupled with cohesive and expansive cross-domain mixtures of experts, enables efficient state aggregation, global dependency capture, and cross-domain knowledge transfer. Empirical results show OTCE competitive with medium-scale open-source models and superior on long-context and associative recall tasks, with notable gains when using Expresser reweighting and joint RoPE usage. The approach offers a scalable framework for long-context language modeling with improved data efficiency and reduced routing bias, potentially impacting practical NLP systems requiring long-context reasoning and cross-domain knowledge integration.

Abstract

Recent research has shown that combining Mamba with Transformer architecture, which has selective state space and quadratic self-attention mechanism, outperforms using Mamba or Transformer architecture alone in language modeling tasks. The quadratic self-attention mechanism effectively alleviates the shortcomings of selective state space in handling long-term dependencies of any element in the sequence. We propose a position information injection method that connects the selective state space model with the quadratic attention, and integrates these two architectures with hybrid experts with cross-sharing domains, so that we can enjoy the advantages of both. We design a new architecture with a more biomimetic idea: Observer-Thinker-Conceiver-Expresser (OTCE), which can compete with well-known medium-scale open-source language models on a small scale in language modeling tasks.

OTCE: Hybrid SSM and Attention with Cross Domain Mixture of Experts to construct Observer-Thinker-Conceiver-Expresser

TL;DR

OTCE addresses the challenge of modeling long-context language with efficient computation by blending a selective state space model (SSM) with self-attention, bridged by a Rotational Positional Encoding scheme. The Observer-Thinker-Conceiver-Expresser architecture, coupled with cohesive and expansive cross-domain mixtures of experts, enables efficient state aggregation, global dependency capture, and cross-domain knowledge transfer. Empirical results show OTCE competitive with medium-scale open-source models and superior on long-context and associative recall tasks, with notable gains when using Expresser reweighting and joint RoPE usage. The approach offers a scalable framework for long-context language modeling with improved data efficiency and reduced routing bias, potentially impacting practical NLP systems requiring long-context reasoning and cross-domain knowledge integration.

Abstract

Recent research has shown that combining Mamba with Transformer architecture, which has selective state space and quadratic self-attention mechanism, outperforms using Mamba or Transformer architecture alone in language modeling tasks. The quadratic self-attention mechanism effectively alleviates the shortcomings of selective state space in handling long-term dependencies of any element in the sequence. We propose a position information injection method that connects the selective state space model with the quadratic attention, and integrates these two architectures with hybrid experts with cross-sharing domains, so that we can enjoy the advantages of both. We design a new architecture with a more biomimetic idea: Observer-Thinker-Conceiver-Expresser (OTCE), which can compete with well-known medium-scale open-source language models on a small scale in language modeling tasks.

Paper Structure

This paper contains 33 sections, 51 equations, 13 figures, 9 tables, 4 algorithms.

Figures (13)

  • Figure 1: (OTCE Architecture.) OTCE demonstrates the overall combined architecture and process of using Observer, Thinker, Conceiver, and Expresser modules in language modeling tasks. Observer, Thinker, Conceiver, and Expresser show their internal combination of selective state space, self-attention, multi-layer perceptron, and cross-domain mixed experts.
  • Figure 2: (Shared Expert Isolation.) The hidden state of the entire sequence calculated by the shared expert is added to the hidden state of each token calculated by the most relevant routing expert. Each token is composed of the shared expert state and the routing expert state.
  • Figure 3: (RoPE for SSM.) The structure diagram of applying rotational positional encoding to the Selective State Space Model.
  • Figure 4: (Mamba's Positional Information.) Mamba provides continuous relative positional information for the matrix $D$ that skips the connection between the input and output of SSM by convolution.
  • Figure 5: (RoPE for Quadratic Self-Attention.) Applying rotational positional encoding to the quadratic self-attention mechanism, reintroducing continuous positional encoding information before the $QK$ inner product operation.
  • ...and 8 more figures

Theorems & Definitions (1)

  • proof : Proof of \ref{['eq:ssm_rope_g_final']}