Table of Contents
Fetching ...

Mining YouTube - A dataset for learning fine-grained action concepts from webly supervised video data

Hilde Kuehne, Ahsan Iqbal, Alexander Richard, Juergen Gall

TL;DR

The paper tackles the challenge of learning fine-grained human actions beyond handcrafted, fully supervised datasets by constructing a large webly supervised benchmark built from YouTube cooking videos. It introduces a densely labeled test set (250 videos, 512 classes) and a training set (~200k samples) mined from subtitles, all without human annotation, and organizes the actions into a hierarchical taxonomy to manage semantic noise. Through Temporal Segment Networks using pre-trained features, the study analyzes multiple data sources, mining strategies, and model bridges, showing that subtitle-driven data and hierarchical inference substantially improve weakly supervised action recognition over baselines and plain web-crawled data. The work demonstrates scalable, reproducible methods for learning action concepts from noisy, real-world video, with significant implications for deploying action recognition in the wild.

Abstract

Action recognition is so far mainly focusing on the problem of classification of hand selected preclipped actions and reaching impressive results in this field. But with the performance even ceiling on current datasets, it also appears that the next steps in the field will have to go beyond this fully supervised classification. One way to overcome those problems is to move towards less restricted scenarios. In this context we present a large-scale real-world dataset designed to evaluate learning techniques for human action recognition beyond hand-crafted datasets. To this end we put the process of collecting data on its feet again and start with the annotation of a test set of 250 cooking videos. The training data is then gathered by searching for the respective annotated classes within the subtitles of freely available videos. The uniqueness of the dataset is attributed to the fact that the whole process of collecting the data and training does not involve any human intervention. To address the problem of semantic inconsistencies that arise with this kind of training data, we further propose a semantical hierarchical structure for the mined classes.

Mining YouTube - A dataset for learning fine-grained action concepts from webly supervised video data

TL;DR

The paper tackles the challenge of learning fine-grained human actions beyond handcrafted, fully supervised datasets by constructing a large webly supervised benchmark built from YouTube cooking videos. It introduces a densely labeled test set (250 videos, 512 classes) and a training set (~200k samples) mined from subtitles, all without human annotation, and organizes the actions into a hierarchical taxonomy to manage semantic noise. Through Temporal Segment Networks using pre-trained features, the study analyzes multiple data sources, mining strategies, and model bridges, showing that subtitle-driven data and hierarchical inference substantially improve weakly supervised action recognition over baselines and plain web-crawled data. The work demonstrates scalable, reproducible methods for learning action concepts from noisy, real-world video, with significant implications for deploying action recognition in the wild.

Abstract

Action recognition is so far mainly focusing on the problem of classification of hand selected preclipped actions and reaching impressive results in this field. But with the performance even ceiling on current datasets, it also appears that the next steps in the field will have to go beyond this fully supervised classification. One way to overcome those problems is to move towards less restricted scenarios. In this context we present a large-scale real-world dataset designed to evaluate learning techniques for human action recognition beyond hand-crafted datasets. To this end we put the process of collecting data on its feet again and start with the annotation of a test set of 250 cooking videos. The training data is then gathered by searching for the respective annotated classes within the subtitles of freely available videos. The uniqueness of the dataset is attributed to the fact that the whole process of collecting the data and training does not involve any human intervention. To address the problem of semantic inconsistencies that arise with this kind of training data, we further propose a semantical hierarchical structure for the mined classes.

Paper Structure

This paper contains 18 sections, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Examples of frames and related classes from YouTube cooking videos. Similar to malmaud15what, we follow the idea of automatically mining large scale training data from videos and subtitles without the need for human intervention.
  • Figure 2: Examples for different mining strategies: (a) In case of mining neighboured components, only keywords following directly after another are considered as possible class labels. (b) In case of ordered lists, all keyword connections from one keyword to the next one are allowed, as long as the order is preserved. (c) In case of scrambling, all combinations of all keywords within the subtitle are allowed.
  • Figure 3: Overview of the hierarchical model for the proposed action classes.
  • Figure 4: Overview of the hierarchical model for the proposed action classes.
  • Figure 5: Qualitaive examples of the alignment of two sample videos from the test set.