Table of Contents
Fetching ...

SeFAR: Semi-supervised Fine-grained Action Recognition with Temporal Perturbation and Learning Stabilization

Yongle Huang, Haodong Chen, Zhenbang Xu, Zihan Jia, Haozhou Sun, Dian Shao

TL;DR

SeFAR tackles semi-supervised fine-grained action recognition by combining dual-level temporal elements with moderate temporal perturbation within a FixMatch-style mean-teacher framework, and employs adaptive regulation to stabilize learning. It introduces a dual-level temporal representation with local fine-grained elements and a broader context, plus a temporal perturbation augmentation that preserves action directionality. The training objective integrates supervised and unsupervised signals via $L = L_{sup} + ξ(η L_{un} + η' L_{mix})$, with adaptive coefficients $η$ and $η'$ driven by teacher-prediction statistics. Empirically, SeFAR achieves state-of-the-art results on FineGym and FineDiving and improves performance on UCF101 and HMDB51, while ablations validate the contributions of each component and demonstrate potential for enhancing multimodal foundation model understanding of fine-grained semantics.

Abstract

Human action understanding is crucial for the advancement of multimodal systems. While recent developments, driven by powerful large language models (LLMs), aim to be general enough to cover a wide range of categories, they often overlook the need for more specific capabilities. In this work, we address the more challenging task of Fine-grained Action Recognition (FAR), which focuses on detailed semantic labels within shorter temporal duration (e.g., "salto backward tucked with 1 turn"). Given the high costs of annotating fine-grained labels and the substantial data needed for fine-tuning LLMs, we propose to adopt semi-supervised learning (SSL). Our framework, SeFAR, incorporates several innovative designs to tackle these challenges. Specifically, to capture sufficient visual details, we construct Dual-level temporal elements as more effective representations, based on which we design a new strong augmentation strategy for the Teacher-Student learning paradigm through involving moderate temporal perturbation. Furthermore, to handle the high uncertainty within the teacher model's predictions for FAR, we propose the Adaptive Regulation to stabilize the learning process. Experiments show that SeFAR achieves state-of-the-art performance on two FAR datasets, FineGym and FineDiving, across various data scopes. It also outperforms other semi-supervised methods on two classical coarse-grained datasets, UCF101 and HMDB51. Further analysis and ablation studies validate the effectiveness of our designs. Additionally, we show that the features extracted by our SeFAR could largely promote the ability of multimodal foundation models to understand fine-grained and domain-specific semantics.

SeFAR: Semi-supervised Fine-grained Action Recognition with Temporal Perturbation and Learning Stabilization

TL;DR

SeFAR tackles semi-supervised fine-grained action recognition by combining dual-level temporal elements with moderate temporal perturbation within a FixMatch-style mean-teacher framework, and employs adaptive regulation to stabilize learning. It introduces a dual-level temporal representation with local fine-grained elements and a broader context, plus a temporal perturbation augmentation that preserves action directionality. The training objective integrates supervised and unsupervised signals via , with adaptive coefficients and driven by teacher-prediction statistics. Empirically, SeFAR achieves state-of-the-art results on FineGym and FineDiving and improves performance on UCF101 and HMDB51, while ablations validate the contributions of each component and demonstrate potential for enhancing multimodal foundation model understanding of fine-grained semantics.

Abstract

Human action understanding is crucial for the advancement of multimodal systems. While recent developments, driven by powerful large language models (LLMs), aim to be general enough to cover a wide range of categories, they often overlook the need for more specific capabilities. In this work, we address the more challenging task of Fine-grained Action Recognition (FAR), which focuses on detailed semantic labels within shorter temporal duration (e.g., "salto backward tucked with 1 turn"). Given the high costs of annotating fine-grained labels and the substantial data needed for fine-tuning LLMs, we propose to adopt semi-supervised learning (SSL). Our framework, SeFAR, incorporates several innovative designs to tackle these challenges. Specifically, to capture sufficient visual details, we construct Dual-level temporal elements as more effective representations, based on which we design a new strong augmentation strategy for the Teacher-Student learning paradigm through involving moderate temporal perturbation. Furthermore, to handle the high uncertainty within the teacher model's predictions for FAR, we propose the Adaptive Regulation to stabilize the learning process. Experiments show that SeFAR achieves state-of-the-art performance on two FAR datasets, FineGym and FineDiving, across various data scopes. It also outperforms other semi-supervised methods on two classical coarse-grained datasets, UCF101 and HMDB51. Further analysis and ablation studies validate the effectiveness of our designs. Additionally, we show that the features extracted by our SeFAR could largely promote the ability of multimodal foundation models to understand fine-grained and domain-specific semantics.
Paper Structure (43 sections, 8 equations, 9 figures, 8 tables)

This paper contains 43 sections, 8 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Fine-grained Action Instances. The two samples are drawn from the FineGym shao2020finegym dataset, specifically the "pike sole circle backward with 0.5 turn to handstand" at the top and the "... 1 turn ..." at the bottom. We further test popular MLLMs on the bottom instance for both coarse-grained and fine-grained: GPT-4V openai2024gpt4vsystemcard, VideoChat2 li2024mvbench, VideoLLaVA lin2023video, and InternLM-XComposer-2.5 internlmxcomposer2_5.
  • Figure 2: Overview of SeFAR pipeline. We target Semi-supervised FAR, assuming most input samples are unlabeled. During unsupervised learning, SeFAR adopts dual-level temporal elements modeling and performs augmentation in two manners ('Weak' vs. 'Strong'). Strongly augmented/distorted samples by moderate temporal perturbation are used by the student model, while the teacher model offers pseudo-labels based on weakly augmented samples. Consistency is enforced through loss minimization ($\mathcal{L}_{un}$). The unsupervised loss is further adjusted by our proposed Adaptive Regulation. The framework is trained with a weighted combination of supervised $\mathcal{L}_{sup}$ and unsupervised $\mathcal{L}_{un}$ losses.
  • Figure 3: (a) For $K$ unlabeled videos, the Teacher model predicts each video multiple times to capture the distribution of predictions, which shows less variability on coarse-grained data and more on fine-grained data. An adaptive coefficient $\eta$ is calculated from the mean and variance of the distribution to stabilize training. (b) MLLM construction pipeline with SeFAR's fine-grained features.
  • Figure 4: Ablation Studies. We compare SeFAR-B with different sampling combinations on Gym-99 5%, as illustrated on the left. We also contrast fixed threshold methods with our Adaptive Regulation strategy on FineDiving 5% in the middle. On the right side, we demonstrate the fluctuation of predictions made by the Teacher model across different datasets.
  • Figure 5: The relationship between the Teacher model's prediction accuracy and its confidence (left), as well as its standard deviation (right).
  • ...and 4 more figures