Table of Contents
Fetching ...

Video RWKV:Video Action Recognition Based RWKV

Zhuowen Yin, Chengru Li, Xingbo Dong

TL;DR

A LSTM CrossRWKV (LCR) framework, designed for spatiotemporal representation learning to tackle the video understanding task, incorporating a novel Cross RWKV gate to facilitate interaction be tween current frame edge information and past features.

Abstract

To address the challenges of high computational costs and long-distance dependencies in exist ing video understanding methods, such as CNNs and Transformers, this work introduces RWKV to the video domain in a novel way. We propose a LSTM CrossRWKV (LCR) framework, designed for spatiotemporal representation learning to tackle the video understanding task. Specifically, the proposed linear complexity LCR incorporates a novel Cross RWKV gate to facilitate interaction be tween current frame edge information and past features, enhancing the focus on the subject through edge features and globally aggregating inter-frame features over time. LCR stores long-term mem ory for video processing through an enhanced LSTM recurrent execution mechanism. By leveraging the Cross RWKV gate and recurrent execution, LCR effectively captures both spatial and temporal features. Additionally, the edge information serves as a forgetting gate for LSTM, guiding long-term memory management.Tube masking strategy reduces redundant information in food and reduces overfitting.These advantages enable LSTM CrossRWKV to set a new benchmark in video under standing, offering a scalable and efficient solution for comprehensive video analysis. All code and models are publicly available.

Video RWKV:Video Action Recognition Based RWKV

TL;DR

A LSTM CrossRWKV (LCR) framework, designed for spatiotemporal representation learning to tackle the video understanding task, incorporating a novel Cross RWKV gate to facilitate interaction be tween current frame edge information and past features.

Abstract

To address the challenges of high computational costs and long-distance dependencies in exist ing video understanding methods, such as CNNs and Transformers, this work introduces RWKV to the video domain in a novel way. We propose a LSTM CrossRWKV (LCR) framework, designed for spatiotemporal representation learning to tackle the video understanding task. Specifically, the proposed linear complexity LCR incorporates a novel Cross RWKV gate to facilitate interaction be tween current frame edge information and past features, enhancing the focus on the subject through edge features and globally aggregating inter-frame features over time. LCR stores long-term mem ory for video processing through an enhanced LSTM recurrent execution mechanism. By leveraging the Cross RWKV gate and recurrent execution, LCR effectively captures both spatial and temporal features. Additionally, the edge information serves as a forgetting gate for LSTM, guiding long-term memory management.Tube masking strategy reduces redundant information in food and reduces overfitting.These advantages enable LSTM CrossRWKV to set a new benchmark in video under standing, offering a scalable and efficient solution for comprehensive video analysis. All code and models are publicly available.

Paper Structure

This paper contains 12 sections, 12 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: The overview of the LSTM CrossRWKV working pipeline.The input frame $X_{t}$, hidden state $H_{t}$ and cell state $C_{t}$ jointly determine output $O_{t}$, The hidden state and cell state contains memory information which can be transferred to next frame.
  • Figure 2: The framework of LSTM CrossRWKV, we use frame-by-frame analysis for video recognition, which effectively reduces the memory footprint and improves the inference speed
  • Figure 3: LCR Figure (3a) shows a recurrent unit in our framework. Figure (3b) illstrastes how Cross RWKV process both current input $X_{d}$ and current edge feature $X_{e}$.