Table of Contents
Fetching ...

VA-AR: Learning Velocity-Aware Action Representations with Mixture of Window Attention

Jiangning Wei, Lixiong Qin, Bo Yu, Tianjian Zou, Chuhan Yan, Dandan Xiao, Yang Yu, Lan Yang, Ke Li, Jun Liu

TL;DR

This work investigates velocity as a key factor in skeleton-based action recognition, showing that existing methods lose robustness as action speed increases. It introduces VA-AR, a velocity-aware framework built on a Velocity-Aware Transformer that uses Mixture of Window Attention to adapt temporal receptive fields via multiple local and shifted windows. The MoWA routes attention to specialized window-experts, enabling dynamic selection of temporal granularity while reusing efficient local attention blocks. Extensive experiments on five datasets demonstrate state-of-the-art performance, with pronounced gains on athletic actions, validating velocity-insensitive representation learning and practical applicability.

Abstract

Action recognition is a crucial task in artificial intelligence, with significant implications across various domains. We initially perform a comprehensive analysis of seven prominent action recognition methods across five widely-used datasets. This analysis reveals a critical, yet previously overlooked, observation: as the velocity of actions increases, the performance of these methods variably declines, undermining their robustness. This decline in performance poses significant challenges for their application in real-world scenarios. Building on these findings, we introduce the Velocity-Aware Action Recognition (VA-AR) framework to obtain robust action representations across different velocities. Our principal insight is that rapid actions (e.g., the giant circle backward in uneven bars or a smash in badminton) occur within short time intervals, necessitating smaller temporal attention windows to accurately capture intricate changes. Conversely, slower actions (e.g., drinking water or wiping face) require larger windows to effectively encompass the broader context. VA-AR employs a Mixture of Window Attention (MoWA) strategy, dynamically adjusting its attention window size based on the action's velocity. This adjustment enables VA-AR to obtain a velocity-aware representation, thereby enhancing the accuracy of action recognition. Extensive experiments confirm that VA-AR achieves state-of-the-art performance on the same five datasets, demonstrating VA-AR's effectiveness across a broad spectrum of action recognition scenarios.

VA-AR: Learning Velocity-Aware Action Representations with Mixture of Window Attention

TL;DR

This work investigates velocity as a key factor in skeleton-based action recognition, showing that existing methods lose robustness as action speed increases. It introduces VA-AR, a velocity-aware framework built on a Velocity-Aware Transformer that uses Mixture of Window Attention to adapt temporal receptive fields via multiple local and shifted windows. The MoWA routes attention to specialized window-experts, enabling dynamic selection of temporal granularity while reusing efficient local attention blocks. Extensive experiments on five datasets demonstrate state-of-the-art performance, with pronounced gains on athletic actions, validating velocity-insensitive representation learning and practical applicability.

Abstract

Action recognition is a crucial task in artificial intelligence, with significant implications across various domains. We initially perform a comprehensive analysis of seven prominent action recognition methods across five widely-used datasets. This analysis reveals a critical, yet previously overlooked, observation: as the velocity of actions increases, the performance of these methods variably declines, undermining their robustness. This decline in performance poses significant challenges for their application in real-world scenarios. Building on these findings, we introduce the Velocity-Aware Action Recognition (VA-AR) framework to obtain robust action representations across different velocities. Our principal insight is that rapid actions (e.g., the giant circle backward in uneven bars or a smash in badminton) occur within short time intervals, necessitating smaller temporal attention windows to accurately capture intricate changes. Conversely, slower actions (e.g., drinking water or wiping face) require larger windows to effectively encompass the broader context. VA-AR employs a Mixture of Window Attention (MoWA) strategy, dynamically adjusting its attention window size based on the action's velocity. This adjustment enables VA-AR to obtain a velocity-aware representation, thereby enhancing the accuracy of action recognition. Extensive experiments confirm that VA-AR achieves state-of-the-art performance on the same five datasets, demonstrating VA-AR's effectiveness across a broad spectrum of action recognition scenarios.

Paper Structure

This paper contains 23 sections, 5 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: (a) Example videos demonstrating actions at diverse velocities. (b) Performance-velocity curves for seven prominent action recognition methods and our proposed VA-AR, evaluated across five widely-used datasets.
  • Figure 2: (a) The Architecture of VA-AR. The Velocity-Aware Transformer incorporates the Mixture of Window Attention (MoWA), implemented as Multi-scale Local Window Attention (MLWA) and Multi-scale Shifted Window Attention (MSWA). Within this framework, MoWA is embedded to dynamically adjust the attention window weights across various scales, facilitating adaptation to changes in action velocity. (b) Different window partitioning strategies, with a window size of 4.
  • Figure 3: Visualization of velocity-time curves and the corresponding weights generated by our Mixture of Window Attention (MoWA) for each frame across varying velocities. It is noticed that the scales of the velocity of these samples are different. (Right column: slow actions; Middle column: middle-speed actions; Left column: fast actions)