Table of Contents
Fetching ...

PosMLP-Video: Spatial and Temporal Relative Position Encoding for Efficient Video Recognition

Yanbin Hao, Diansong Zhou, Zhicai Wang, Chong-Wah Ngo, Meng Wang

TL;DR

This work introduces PosMLP-Video, a lightweight MLP-like backbone for video recognition that replaces dense self-attention with Learnable Relative Position Encoding (LRPE) to model pairwise token relations using small, channel-grouped bias dictionaries. It defines three spatio-temporal gating units—PoTGU, PoSGU, and PoSTGU—and organizes them into factorized blocks to achieve efficient spatio-temporal modeling with a complexity near $O(N)$ per relation; a four-stage architecture with windowed processing further reduces computation. Across video recognition benchmarks (e.g., Something-Something V1/V2, Kinetics-400) and tasks like action detection and micro-expression recognition, PosMLP-Video delivers competitive or superior accuracy with substantially fewer parameters and FLOPs than state-of-the-art transformers and MLP-based models, aided by ImageNet1K pretraining and effective LRPE design. The results demonstrate the method’s versatility and practicality as a lightweight, scalable video backbone with strong transferability to diverse video understanding tasks.

Abstract

In recent years, vision Transformers and MLPs have demonstrated remarkable performance in image understanding tasks. However, their inherently dense computational operators, such as self-attention and token-mixing layers, pose significant challenges when applied to spatio-temporal video data. To address this gap, we propose PosMLP-Video, a lightweight yet powerful MLP-like backbone for video recognition. Instead of dense operators, we use efficient relative positional encoding (RPE) to build pairwise token relations, leveraging small-sized parameterized relative position biases to obtain each relation score. Specifically, to enable spatio-temporal modeling, we extend the image PosMLP's positional gating unit to temporal, spatial, and spatio-temporal variants, namely PoTGU, PoSGU, and PoSTGU, respectively. These gating units can be feasibly combined into three types of spatio-temporal factorized positional MLP blocks, which not only decrease model complexity but also maintain good performance. Additionally, we enrich relative positional relationships by using channel grouping. Experimental results on three video-related tasks demonstrate that PosMLP-Video achieves competitive speed-accuracy trade-offs compared to the previous state-of-the-art models. In particular, PosMLP-Video pre-trained on ImageNet1K achieves 59.0%/70.3% top-1 accuracy on Something-Something V1/V2 and 82.1% top-1 accuracy on Kinetics-400 while requiring much fewer parameters and FLOPs than other models. The code is released at https://github.com/zhouds1918/PosMLP_Video.

PosMLP-Video: Spatial and Temporal Relative Position Encoding for Efficient Video Recognition

TL;DR

This work introduces PosMLP-Video, a lightweight MLP-like backbone for video recognition that replaces dense self-attention with Learnable Relative Position Encoding (LRPE) to model pairwise token relations using small, channel-grouped bias dictionaries. It defines three spatio-temporal gating units—PoTGU, PoSGU, and PoSTGU—and organizes them into factorized blocks to achieve efficient spatio-temporal modeling with a complexity near per relation; a four-stage architecture with windowed processing further reduces computation. Across video recognition benchmarks (e.g., Something-Something V1/V2, Kinetics-400) and tasks like action detection and micro-expression recognition, PosMLP-Video delivers competitive or superior accuracy with substantially fewer parameters and FLOPs than state-of-the-art transformers and MLP-based models, aided by ImageNet1K pretraining and effective LRPE design. The results demonstrate the method’s versatility and practicality as a lightweight, scalable video backbone with strong transferability to diverse video understanding tasks.

Abstract

In recent years, vision Transformers and MLPs have demonstrated remarkable performance in image understanding tasks. However, their inherently dense computational operators, such as self-attention and token-mixing layers, pose significant challenges when applied to spatio-temporal video data. To address this gap, we propose PosMLP-Video, a lightweight yet powerful MLP-like backbone for video recognition. Instead of dense operators, we use efficient relative positional encoding (RPE) to build pairwise token relations, leveraging small-sized parameterized relative position biases to obtain each relation score. Specifically, to enable spatio-temporal modeling, we extend the image PosMLP's positional gating unit to temporal, spatial, and spatio-temporal variants, namely PoTGU, PoSGU, and PoSTGU, respectively. These gating units can be feasibly combined into three types of spatio-temporal factorized positional MLP blocks, which not only decrease model complexity but also maintain good performance. Additionally, we enrich relative positional relationships by using channel grouping. Experimental results on three video-related tasks demonstrate that PosMLP-Video achieves competitive speed-accuracy trade-offs compared to the previous state-of-the-art models. In particular, PosMLP-Video pre-trained on ImageNet1K achieves 59.0%/70.3% top-1 accuracy on Something-Something V1/V2 and 82.1% top-1 accuracy on Kinetics-400 while requiring much fewer parameters and FLOPs than other models. The code is released at https://github.com/zhouds1918/PosMLP_Video.
Paper Structure (17 sections, 5 equations, 8 figures, 18 tables)

This paper contains 17 sections, 5 equations, 8 figures, 18 tables.

Figures (8)

  • Figure 1: Positional spatial and temporal gating units.
  • Figure 2: The schema of the four factorized spatio-temporal PosMLP blocks. The channel expansion ratio $r_e$ is set to 2 and 4 in our implementation.
  • Figure 3: Overall architecture of PosMLP-Video.
  • Figure 4: Patch embedding modules.
  • Figure 5: PoTGU+PoSGU versions. "V1" is the used version of PosMLP-Video. "V2" adopts the channel splitting before inputting into the pos units and then concatenates the outputs of PoTGU and PoSGU along the channel dimension. "V3" separately inputs the feature into PoTGU and PoSGU and then concatenates their outputs along the channel dimension. In contrast to V3, "V4" elementwisely adds the outputs of PoTGU and PoSGU.
  • ...and 3 more figures