Table of Contents
Fetching ...

State Tuning: State-based Test-Time Scaling on RWKV-7

Liu Xiao, Li Zhiyuan, Lin Yueyu

TL;DR

The paper tackles the challenge of making small recurrent models like RWKV-7 competitive with larger LMs by avoiding weight retraining and instead tuning internal states.It introduces four state-tuning strategies: Standard State Tuning, Dynamic Scaling with kernel-based state upscaling, DBP-Enhanced Dynamic State Tuning, and Test-Time Scaling guided by a larger LM using reinforcement learning and chain-of-thought reasoning.Empirical results on MMLU, GSM8K, WinoGrande, and ARC-C show all methods surpass the vanilla RWKV-7 baseline, with DBP achieving the strongest overall gains and test-time scaling providing strong inference-time flexibility.These methods offer scalable, resource-efficient avenues to narrow the gap between small RWKV-7 models and larger models, enabling task-specific adaptation without retraining.

Abstract

Test-time scaling has emerged as a prominent research direction in machine learning, enabling models to enhance their expressive capabilities during inference.Transformers, renowned for striking a delicate balance between efficiency and expressiveness, have benefited from test-time scaling techniques that leverage an expanding key-value (KV) cache to significantly improve performance.In this paper, we introduce a novel state-based approach to test-time scaling, which we term state tuning, tailored to the RNN-based RWKV-7 model.By exploiting the unique strengths of RWKV-7, our method achieves state-of-the-art performance on the target task without altering the model's pre-trained weights. Our approach centers on three key innovations. First, we develop an observer framework that allows a smaller model to replicate and learn the state dynamics of the RWKV-7 model. Second, we employ a kernel method to dynamically upscale the state size, enhancing the model's capacity to capture intricate patterns. Third, we integrate Decorrelated Backpropagation (DBP) to optimize the upscaled state matrix, thereby improving convergence and expressivity. By tuning only the state matrix, we demonstrate that a smaller model can outperform larger models on the given task. This method preserves the efficiency of the original RWKV-7 architecture while harnessing the power of test-time scaling to deliver superior results. Our findings underscore the potential of state tuning as an effective strategy for advancing model performance in resource-constrained settings. Our code is https://github.com/TorchRWKV/flash-linear-attention.

State Tuning: State-based Test-Time Scaling on RWKV-7

TL;DR

The paper tackles the challenge of making small recurrent models like RWKV-7 competitive with larger LMs by avoiding weight retraining and instead tuning internal states.It introduces four state-tuning strategies: Standard State Tuning, Dynamic Scaling with kernel-based state upscaling, DBP-Enhanced Dynamic State Tuning, and Test-Time Scaling guided by a larger LM using reinforcement learning and chain-of-thought reasoning.Empirical results on MMLU, GSM8K, WinoGrande, and ARC-C show all methods surpass the vanilla RWKV-7 baseline, with DBP achieving the strongest overall gains and test-time scaling providing strong inference-time flexibility.These methods offer scalable, resource-efficient avenues to narrow the gap between small RWKV-7 models and larger models, enabling task-specific adaptation without retraining.

Abstract

Test-time scaling has emerged as a prominent research direction in machine learning, enabling models to enhance their expressive capabilities during inference.Transformers, renowned for striking a delicate balance between efficiency and expressiveness, have benefited from test-time scaling techniques that leverage an expanding key-value (KV) cache to significantly improve performance.In this paper, we introduce a novel state-based approach to test-time scaling, which we term state tuning, tailored to the RNN-based RWKV-7 model.By exploiting the unique strengths of RWKV-7, our method achieves state-of-the-art performance on the target task without altering the model's pre-trained weights. Our approach centers on three key innovations. First, we develop an observer framework that allows a smaller model to replicate and learn the state dynamics of the RWKV-7 model. Second, we employ a kernel method to dynamically upscale the state size, enhancing the model's capacity to capture intricate patterns. Third, we integrate Decorrelated Backpropagation (DBP) to optimize the upscaled state matrix, thereby improving convergence and expressivity. By tuning only the state matrix, we demonstrate that a smaller model can outperform larger models on the given task. This method preserves the efficiency of the original RWKV-7 architecture while harnessing the power of test-time scaling to deliver superior results. Our findings underscore the potential of state tuning as an effective strategy for advancing model performance in resource-constrained settings. Our code is https://github.com/TorchRWKV/flash-linear-attention.

Paper Structure

This paper contains 24 sections, 6 equations, 1 table.