Table of Contents
Fetching ...

BlackGoose Rimer: Harnessing RWKV-7 as a Simple yet Superior Replacement for Transformers in Large-Scale Time Series Modeling

Li weile, Liu Xiao

TL;DR

The paper addresses the challenge of scaling large-time-series models by replacing Timer's Transformer backbone with RWKV-7, which incorporates time mix and channel mix and is implemented in an implicit DEQ framework to enable effectively infinite-depth recurrence. The authors demonstrate that a 1.6M-parameter Rimer model can match or exceed the performance of a 37.8M-parameter Timer model, achieving up to ≈4.5× training-time speedups across multiple datasets and showing strong cross-hardware compatibility via Triton on ROCm platforms. Key contributions include revisiting RWKV-7 for time series, integrating it into a Transformer-based architecture, and presenting DEQ-based implicit layers for efficiency, with public code and weights. The work highlights RWKV-7 as a practical, scalable alternative for large-scale time-series modeling with significant gains in efficiency and robustness for forecasting tasks.

Abstract

Time series models face significant challenges in scaling to handle large and complex datasets, akin to the scaling achieved by large language models (LLMs). The unique characteristics of time series data and the computational demands of model scaling necessitate innovative approaches. While researchers have explored various architectures such as Transformers, LSTMs, and GRUs to address these challenges, we propose a novel solution using RWKV-7, which incorporates meta-learning into its state update mechanism. By integrating RWKV-7's time mix and channel mix components into the transformer-based time series model Timer, we achieve a substantial performance improvement of approximately 1.13 to 43.3x and a 4.5x reduction in training time with 1/23 parameters, all while utilizing fewer parameters. Our code and model weights are publicly available for further research and development at https://github.com/Alic-Li/BlackGoose_Rimer.

BlackGoose Rimer: Harnessing RWKV-7 as a Simple yet Superior Replacement for Transformers in Large-Scale Time Series Modeling

TL;DR

The paper addresses the challenge of scaling large-time-series models by replacing Timer's Transformer backbone with RWKV-7, which incorporates time mix and channel mix and is implemented in an implicit DEQ framework to enable effectively infinite-depth recurrence. The authors demonstrate that a 1.6M-parameter Rimer model can match or exceed the performance of a 37.8M-parameter Timer model, achieving up to ≈4.5× training-time speedups across multiple datasets and showing strong cross-hardware compatibility via Triton on ROCm platforms. Key contributions include revisiting RWKV-7 for time series, integrating it into a Transformer-based architecture, and presenting DEQ-based implicit layers for efficiency, with public code and weights. The work highlights RWKV-7 as a practical, scalable alternative for large-scale time-series modeling with significant gains in efficiency and robustness for forecasting tasks.

Abstract

Time series models face significant challenges in scaling to handle large and complex datasets, akin to the scaling achieved by large language models (LLMs). The unique characteristics of time series data and the computational demands of model scaling necessitate innovative approaches. While researchers have explored various architectures such as Transformers, LSTMs, and GRUs to address these challenges, we propose a novel solution using RWKV-7, which incorporates meta-learning into its state update mechanism. By integrating RWKV-7's time mix and channel mix components into the transformer-based time series model Timer, we achieve a substantial performance improvement of approximately 1.13 to 43.3x and a 4.5x reduction in training time with 1/23 parameters, all while utilizing fewer parameters. Our code and model weights are publicly available for further research and development at https://github.com/Alic-Li/BlackGoose_Rimer.

Paper Structure

This paper contains 10 sections, 4 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: The benchmarks reveal that Rimer, with a significantly reduced parameter count of just 1.6 million, consistently outperforms or matches the performance of Timer, which relies on a much larger 37.8 million parameters, across multiple metrics.
  • Figure 2: The RWKV-7 architecture is a RNN model that processes sequences using repeated RWKV blocks, each containing:1.A time mix block to blend current and past information.2.WKV heads for attention-like processing with an internal state to maintain memory.A channel mix block to transform the data further.