Table of Contents
Fetching ...

Millions of States: Designing a Scalable MoE Architecture with RWKV-7 Meta-learner

Liu Xiao, Li Zhiyuan, Lin Yueyu

TL;DR

This work addresses the limitations of RWKV-7 by enabling token–parameter interactions and native scaling within a state-based framework. It introduces the Meta-State layer, which replaces the FFN and uses a Self-State Encoder that reuses the RWKV-7 WKV state as transformation weights to encode input–state and token–parameter interactions, preserving autoregressive processing with a state-autoregressive update. The architecture supports progressive model scaling by expanding the WKV state while reusing Meta-State parameters, avoiding full retraining. On the Pile language modeling benchmark, Meta-State models consistently outperform Transformer baselines across sizes from 150M to 1.5B parameters, with relative gains increasing with scale, demonstrating both efficiency and scalability of the approach.

Abstract

State-based sequence models like RWKV-7 offer a compelling alternative to Transformer architectures, achieving linear complexity while demonstrating greater expressive power in short-context scenarios and enabling state tracking beyond the \(\text{TC}^0\) complexity class. However, RWKV-7 lacks mechanisms for token-parameter interactions and native scalability, limiting its adaptability and growth without retraining. In this paper, we propose \textbf{Meta-State}, a novel extension to RWKV-7 that replaces attention mechanisms with a fully state-driven approach, integrating token-parameter interactions through a \textbf{Self-State Encoder} (SSE) mechanism. The SSE repurposes a portion of the RWKV-7 Weighted Key-Value (WKV) state as transformation weights to encode token-parameter interactions in a linear, state-driven manner without introducing new trainable matrices or softmax operations, while preserving the autoregressive property of token processing. Meta-State supports progressive model scaling by expanding the WKV state and parameter tokens, reusing existing parameters without retraining. Our approach bridges the gap between state-based modeling, token-parameter interactions, and scalable architectures, offering a flexible framework for efficient and adaptable sequence modeling with linear complexity and constant memory usage.

Millions of States: Designing a Scalable MoE Architecture with RWKV-7 Meta-learner

TL;DR

This work addresses the limitations of RWKV-7 by enabling token–parameter interactions and native scaling within a state-based framework. It introduces the Meta-State layer, which replaces the FFN and uses a Self-State Encoder that reuses the RWKV-7 WKV state as transformation weights to encode input–state and token–parameter interactions, preserving autoregressive processing with a state-autoregressive update. The architecture supports progressive model scaling by expanding the WKV state while reusing Meta-State parameters, avoiding full retraining. On the Pile language modeling benchmark, Meta-State models consistently outperform Transformer baselines across sizes from 150M to 1.5B parameters, with relative gains increasing with scale, demonstrating both efficiency and scalability of the approach.

Abstract

State-based sequence models like RWKV-7 offer a compelling alternative to Transformer architectures, achieving linear complexity while demonstrating greater expressive power in short-context scenarios and enabling state tracking beyond the complexity class. However, RWKV-7 lacks mechanisms for token-parameter interactions and native scalability, limiting its adaptability and growth without retraining. In this paper, we propose \textbf{Meta-State}, a novel extension to RWKV-7 that replaces attention mechanisms with a fully state-driven approach, integrating token-parameter interactions through a \textbf{Self-State Encoder} (SSE) mechanism. The SSE repurposes a portion of the RWKV-7 Weighted Key-Value (WKV) state as transformation weights to encode token-parameter interactions in a linear, state-driven manner without introducing new trainable matrices or softmax operations, while preserving the autoregressive property of token processing. Meta-State supports progressive model scaling by expanding the WKV state and parameter tokens, reusing existing parameters without retraining. Our approach bridges the gap between state-based modeling, token-parameter interactions, and scalable architectures, offering a flexible framework for efficient and adaptable sequence modeling with linear complexity and constant memory usage.

Paper Structure

This paper contains 23 sections, 9 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Architecture of the proposed RWKV-7 model with the Meta-State layer. Input tokens are processed through normalization (Norm) and state updates, with token shifting and Weighted Key-Value (WKV) mechanisms handling context. The Meta-State layer integrates a Self-State Encoder (SSE) to encode input-state interactions using a portion of the WKV state as transformation weights, evolving the state autoregressively to produce output tokens. The design ensures efficient token-parameter interactions and scalability within a state-based framework.
  • Figure 2: Cross-entropy loss comparison between our proposed Meta-State models (RWKV-7 with Meta-State layer) and Transformer baselines on the Pile test set across different model sizes (150M, 450M, 900M, 1.5B parameters). Our Meta-State models consistently achieve lower loss across all sizes, demonstrating superior efficiency and scalability.