Table of Contents
Fetching ...

Otter: Mitigating Background Distractions of Wide-Angle Few-Shot Action Recognition with Enhanced RWKV

Wenbo Huang, Jinghui Zhang, Zhenghao Chen, Guang Li, Lei Zhang, Yang Cao, Fang Dong, Takahiro Ogawa, Miki Haseyama

TL;DR

This work addresses the challenge of background distractions in wide-angle, few-shot action recognition by introducing Otter, a two-module system that emphasizes subjects and restores temporal relations. The Compound Segmentation Module (CSM) highlights relevant patches before feature extraction, while the Temporal Reconstruction Module (TRM) enables bidirectional temporal scanning to reconstruct degraded sequence relations, complemented by a regular prototype. Through extensive experiments on SSv2, Kinetics, UCF101, HMDB51, and VideoBadminton, Otter achieves state-of-the-art results and demonstrates robustness to varying FoV and real-world noise. The approach offers practical gains for wide-angle FSAR and provides insightful CAM analyses illustrating improved subject-centric focus.

Abstract

Wide-angle videos in few-shot action recognition (FSAR) effectively express actions within specific scenarios. However, without a global understanding of both subjects and background, recognizing actions in such samples remains challenging because of the background distractions. Receptance Weighted Key Value (RWKV), which learns interaction between various dimensions, shows promise for global modeling. While directly applying RWKV to wide-angle FSAR may fail to highlight subjects due to excessive background information. Additionally, temporal relation degraded by frames with similar backgrounds is difficult to reconstruct, further impacting performance. Therefore, we design the CompOund SegmenTation and Temporal REconstructing RWKV (Otter). Specifically, the Compound Segmentation Module~(CSM) is devised to segment and emphasize key patches in each frame, effectively highlighting subjects against background information. The Temporal Reconstruction Module (TRM) is incorporated into the temporal-enhanced prototype construction to enable bidirectional scanning, allowing better reconstruct temporal relation. Furthermore, a regular prototype is combined with the temporal-enhanced prototype to simultaneously enhance subject emphasis and temporal modeling, improving wide-angle FSAR performance. Extensive experiments on benchmarks such as SSv2, Kinetics, UCF101, and HMDB51 demonstrate that Otter achieves state-of-the-art performance. Extra evaluation on the VideoBadminton dataset further validates the superiority of Otter in wide-angle FSAR.

Otter: Mitigating Background Distractions of Wide-Angle Few-Shot Action Recognition with Enhanced RWKV

TL;DR

This work addresses the challenge of background distractions in wide-angle, few-shot action recognition by introducing Otter, a two-module system that emphasizes subjects and restores temporal relations. The Compound Segmentation Module (CSM) highlights relevant patches before feature extraction, while the Temporal Reconstruction Module (TRM) enables bidirectional temporal scanning to reconstruct degraded sequence relations, complemented by a regular prototype. Through extensive experiments on SSv2, Kinetics, UCF101, HMDB51, and VideoBadminton, Otter achieves state-of-the-art results and demonstrates robustness to varying FoV and real-world noise. The approach offers practical gains for wide-angle FSAR and provides insightful CAM analyses illustrating improved subject-centric focus.

Abstract

Wide-angle videos in few-shot action recognition (FSAR) effectively express actions within specific scenarios. However, without a global understanding of both subjects and background, recognizing actions in such samples remains challenging because of the background distractions. Receptance Weighted Key Value (RWKV), which learns interaction between various dimensions, shows promise for global modeling. While directly applying RWKV to wide-angle FSAR may fail to highlight subjects due to excessive background information. Additionally, temporal relation degraded by frames with similar backgrounds is difficult to reconstruct, further impacting performance. Therefore, we design the CompOund SegmenTation and Temporal REconstructing RWKV (Otter). Specifically, the Compound Segmentation Module~(CSM) is devised to segment and emphasize key patches in each frame, effectively highlighting subjects against background information. The Temporal Reconstruction Module (TRM) is incorporated into the temporal-enhanced prototype construction to enable bidirectional scanning, allowing better reconstruct temporal relation. Furthermore, a regular prototype is combined with the temporal-enhanced prototype to simultaneously enhance subject emphasis and temporal modeling, improving wide-angle FSAR performance. Extensive experiments on benchmarks such as SSv2, Kinetics, UCF101, and HMDB51 demonstrate that Otter achieves state-of-the-art performance. Extra evaluation on the VideoBadminton dataset further validates the superiority of Otter in wide-angle FSAR.

Paper Structure

This paper contains 55 sections, 21 equations, 12 figures, 17 tables, 1 algorithm.

Figures (12)

  • Figure 1: Smaller subject proportion (red circles) and degraded temporal relation (red dotted lines) both contribute to background distractions in wide-angle FSAR. As a result, wide-angle samples are more challenging to recognize compared with regular samples.
  • Figure 2: The overall architecture of the Otter. Main components CSM and TRM are specified combination of core units (§ \ref{['core units']}). To be specific, ① Motion Segmentation with CSM and backbone (§ \ref{['MS']}). ② Prototype 1 Construction with TRM for reconstructing temporal relation (§ \ref{['P1']}). ③ Prototype 2 Construction with regular prototype (§ \ref{['P1']}). ④ Training Objective $\mathcal{L}_{\text{total}}$ is the loss combination of cross-entropy loss $\mathcal{L}_{\text{ce}}$, $\mathcal{L}^{1}_{\text{P}}$ from ②, and $\mathcal{L}^{2}_{\text{P}}$ from ③ (§ \ref{['training objective']}). Notion Ⓐ/Ⓐ: averaging/weighted averaging. +/+ : element-wise plus/weighted element-wise plus.
  • Figure 3: Core units of RWKV. Ⓝ: normalization, $\times$/$\cdot$: matrix/element-wise multiplication, $\sigma$: activation function.
  • Figure 4: The structure of Compound Segmentation Module (CSM).
  • Figure 5: The structure of Temporal Reconstruction Module (TRM). $\normalsize{\textcircled{\scriptsize{\textbf{O}}}}$: ordered scanning. $\normalsize{\textcircled{\scriptsize{\textbf{R}}}}$: reserved scanning.
  • ...and 7 more figures