Table of Contents
Fetching ...

Visual Self-paced Iterative Learning for Unsupervised Temporal Action Localization

Yupeng Hu, Han Jiang, Hao Liu, Kun Wang, Haoyu Tang, Liqiang Nie

TL;DR

This paper tackles unsupervised temporal action localization (UTAL), where models must learn action boundaries without labeled temporal annotations. It introduces FEEL, a self-paced iterative learning framework that jointly improves clustering confidence and localization training through Clustering Confidence Improvement (CCI) and Self-paced Incremental Instance Selection (IIS), integrated with a CoLA-based localization backbone. FEEL refines pseudo-labels with a feature-robust Jaccard distance based on l-reciprocal nearest neighbors and progressively expands training data via easy-to-hard sampling, improving robustness against noisy pseudolabels. Empirical results on THUMOS'14 and ActivityNet v1.2 show FEEL achieving state-of-the-art unsupervised TAL performance, with ablations and analyses confirming the synergistic impact of CCI and IIS and demonstrating scalability to other UTAL baselines.

Abstract

Recently, temporal action localization (TAL) has garnered significant interest in information retrieval community. However, existing supervised/weakly supervised methods are heavily dependent on extensive labeled temporal boundaries and action categories, which is labor-intensive and time-consuming. Although some unsupervised methods have utilized the ``iteratively clustering and localization'' paradigm for TAL, they still suffer from two pivotal impediments: 1) unsatisfactory video clustering confidence, and 2) unreliable video pseudolabels for model training. To address these limitations, we present a novel self-paced iterative learning model to enhance clustering and localization training simultaneously, thereby facilitating more effective unsupervised TAL. Concretely, we improve the clustering confidence through exploring the contextual feature-robust visual information. Thereafter, we design two (constant- and variable- speed) incremental instance learning strategies for easy-to-hard model training, thus ensuring the reliability of these video pseudolabels and further improving overall localization performance. Extensive experiments on two public datasets have substantiated the superiority of our model over several state-of-the-art competitors.

Visual Self-paced Iterative Learning for Unsupervised Temporal Action Localization

TL;DR

This paper tackles unsupervised temporal action localization (UTAL), where models must learn action boundaries without labeled temporal annotations. It introduces FEEL, a self-paced iterative learning framework that jointly improves clustering confidence and localization training through Clustering Confidence Improvement (CCI) and Self-paced Incremental Instance Selection (IIS), integrated with a CoLA-based localization backbone. FEEL refines pseudo-labels with a feature-robust Jaccard distance based on l-reciprocal nearest neighbors and progressively expands training data via easy-to-hard sampling, improving robustness against noisy pseudolabels. Empirical results on THUMOS'14 and ActivityNet v1.2 show FEEL achieving state-of-the-art unsupervised TAL performance, with ablations and analyses confirming the synergistic impact of CCI and IIS and demonstrating scalability to other UTAL baselines.

Abstract

Recently, temporal action localization (TAL) has garnered significant interest in information retrieval community. However, existing supervised/weakly supervised methods are heavily dependent on extensive labeled temporal boundaries and action categories, which is labor-intensive and time-consuming. Although some unsupervised methods have utilized the ``iteratively clustering and localization'' paradigm for TAL, they still suffer from two pivotal impediments: 1) unsatisfactory video clustering confidence, and 2) unreliable video pseudolabels for model training. To address these limitations, we present a novel self-paced iterative learning model to enhance clustering and localization training simultaneously, thereby facilitating more effective unsupervised TAL. Concretely, we improve the clustering confidence through exploring the contextual feature-robust visual information. Thereafter, we design two (constant- and variable- speed) incremental instance learning strategies for easy-to-hard model training, thus ensuring the reliability of these video pseudolabels and further improving overall localization performance. Extensive experiments on two public datasets have substantiated the superiority of our model over several state-of-the-art competitors.
Paper Structure (28 sections, 17 equations, 8 figures, 4 tables)

This paper contains 28 sections, 17 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Illustration of the temporal action localization task.
  • Figure 2: An illustration of our FEEL model. Based on the initial clustering results, it conducts three stages within each iteration: adopting the CCI to refine the initial clustering for pseudolabel generation; employing IIS to select the most reliable instances for localization training; localization model training. Within the iteration, we employ distinct shapes to distinguish different clusters. Besides, a solid dot means that the corresponding video is correctly pseudolabeled, while a hollow dot means the opposite. The red, solid dots specifically denote the clustering centers of each cluster. As we can see, the CCI module corrects some mislabeled videos, and simultaneously pulls correctly labeled instances closer to the clustering centers while moving erroneously labeled ones farther away. Afterward, only the videos with high-labeling quality (the dots within the shaded region) are selected for model training.
  • Figure 3: Illustrations of the enlarging curves with different $I_{max}$, where the solid and dashed lines are the constant mode and variable mode, respectively.
  • Figure 4: Illustration of CCI and IIS module on an action cluster, where the positive videos of this cluster are marked in green rectangle. Top: The initial top-6 ranking list of a clustering center, where P1-P4 are positives, N1-N2 in red rectangle are negatives. P4 marked with $\times$ means this positive video is falsely labeled to other action. Middle: Each two columns represents the top-6 neighbors of the corresponding video. It is evident that a significant overlap exists between the top-6 neighbors of P1-P4 and those of the clustering center. Bottom: The reranking top-6 list of this cluster. Based on IIS module, only the top-4 videos of this list, which are highlighted in yellow, are selected for model training.
  • Figure 5: Localization results and clustering results of our FEEL-F model w.r.t iterations.
  • ...and 3 more figures