Table of Contents
Fetching ...

The Role of Video Generation in Enhancing Data-Limited Action Understanding

Wei Li, Dezhao Luo, Dongbao Yang, Zhenhang Li, Weiping Wang, Yu Zhou

TL;DR

This work tackles data scarcity in video action understanding by proposing a data-bridging framework that uses a text-to-video diffusion transformer to generate annotated training data. It introduces two key innovations—the information enhancement strategy and uncertainty-based label smoothing—to enrich synthetic video content and mitigate low-quality samples, respectively. Through extensive experiments on four datasets across five tasks, the approach achieves state-of-the-art zero-shot action recognition and broad improvements in few-shot, base-to-novel, long-tail, and abnormal action detection. The results demonstrate that diffusion-based synthetic data, when properly enhanced and regulated, can effectively supplement real data and enable scalable, data-efficient video understanding in practical settings.

Abstract

Video action understanding tasks in real-world scenarios always suffer data limitations. In this paper, we address the data-limited action understanding problem by bridging data scarcity. We propose a novel method that employs a text-to-video diffusion transformer to generate annotated data for model training. This paradigm enables the generation of realistic annotated data on an infinite scale without human intervention. We proposed the information enhancement strategy and the uncertainty-based label smoothing tailored to generate sample training. Through quantitative and qualitative analysis, we observed that real samples generally contain a richer level of information than generated samples. Based on this observation, the information enhancement strategy is proposed to enhance the informative content of the generated samples from two aspects: the environments and the characters. Furthermore, we observed that some low-quality generated samples might negatively affect model training. To address this, we devised the uncertainty-based label smoothing strategy to increase the smoothing of these samples, thus reducing their impact. We demonstrate the effectiveness of the proposed method on four datasets across five tasks and achieve state-of-the-art performance for zero-shot action recognition.

The Role of Video Generation in Enhancing Data-Limited Action Understanding

TL;DR

This work tackles data scarcity in video action understanding by proposing a data-bridging framework that uses a text-to-video diffusion transformer to generate annotated training data. It introduces two key innovations—the information enhancement strategy and uncertainty-based label smoothing—to enrich synthetic video content and mitigate low-quality samples, respectively. Through extensive experiments on four datasets across five tasks, the approach achieves state-of-the-art zero-shot action recognition and broad improvements in few-shot, base-to-novel, long-tail, and abnormal action detection. The results demonstrate that diffusion-based synthetic data, when properly enhanced and regulated, can effectively supplement real data and enable scalable, data-efficient video understanding in practical settings.

Abstract

Video action understanding tasks in real-world scenarios always suffer data limitations. In this paper, we address the data-limited action understanding problem by bridging data scarcity. We propose a novel method that employs a text-to-video diffusion transformer to generate annotated data for model training. This paradigm enables the generation of realistic annotated data on an infinite scale without human intervention. We proposed the information enhancement strategy and the uncertainty-based label smoothing tailored to generate sample training. Through quantitative and qualitative analysis, we observed that real samples generally contain a richer level of information than generated samples. Based on this observation, the information enhancement strategy is proposed to enhance the informative content of the generated samples from two aspects: the environments and the characters. Furthermore, we observed that some low-quality generated samples might negatively affect model training. To address this, we devised the uncertainty-based label smoothing strategy to increase the smoothing of these samples, thus reducing their impact. We demonstrate the effectiveness of the proposed method on four datasets across five tasks and achieve state-of-the-art performance for zero-shot action recognition.

Paper Structure

This paper contains 20 sections, 4 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: The samples (a) and t-SNE visualizations (b) of the synthetic dataset and the real dataset. From left to right, they are: the HMDB-51 dataset, the synthetic HMDB-51 dataset with the basic strategy and the synthetic HMDB-51 dataset with our proposed information enhancement strategy. (c) Unsatisfactory synthetic videos.
  • Figure 2: (a): The overall structure of our proposed method. We designed two strategies for generating sample training, information enhancement strategy (left) and uncertainty-based label smoothing strategy (right). (b): The process of generating action description information through proposed information enhancement strategy. (c): Uncertainty-based label smoothing uses a higher smoothness for low-quality generated samples with higher uncertainty.
  • Figure 3: Visualization of generated samples. The strategies adopted from left to right are: "Basic", "Env", "Cha" and "IE".