Table of Contents
Fetching ...

Telling Stories for Common Sense Zero-Shot Action Recognition

Shreyank N Gowda, Laura Sevilla-Lara

TL;DR

This work introduces a novel dataset, Stories, which contains rich textual descriptions for diverse action classes extracted from WikiHow articles, which enables modeling of nuanced relationships between actions, paving the way for zero-shot transfer.

Abstract

Video understanding has long suffered from reliance on large labeled datasets, motivating research into zero-shot learning. Recent progress in language modeling presents opportunities to advance zero-shot video analysis, but constructing an effective semantic space relating action classes remains challenging. We address this by introducing a novel dataset, Stories, which contains rich textual descriptions for diverse action classes extracted from WikiHow articles. For each class, we extract multi-sentence narratives detailing the necessary steps, scenes, objects, and verbs that characterize the action. This contextual data enables modeling of nuanced relationships between actions, paving the way for zero-shot transfer. We also propose an approach that harnesses Stories to improve feature generation for training zero-shot classification. Without any target dataset fine-tuning, our method achieves new state-of-the-art on multiple benchmarks, improving top-1 accuracy by up to 6.1%. We believe Stories provides a valuable resource that can catalyze progress in zero-shot action recognition. The textual narratives forge connections between seen and unseen classes, overcoming the bottleneck of labeled data that has long impeded advancements in this exciting domain. The data can be found here: https://github.com/kini5gowda/Stories .

Telling Stories for Common Sense Zero-Shot Action Recognition

TL;DR

This work introduces a novel dataset, Stories, which contains rich textual descriptions for diverse action classes extracted from WikiHow articles, which enables modeling of nuanced relationships between actions, paving the way for zero-shot transfer.

Abstract

Video understanding has long suffered from reliance on large labeled datasets, motivating research into zero-shot learning. Recent progress in language modeling presents opportunities to advance zero-shot video analysis, but constructing an effective semantic space relating action classes remains challenging. We address this by introducing a novel dataset, Stories, which contains rich textual descriptions for diverse action classes extracted from WikiHow articles. For each class, we extract multi-sentence narratives detailing the necessary steps, scenes, objects, and verbs that characterize the action. This contextual data enables modeling of nuanced relationships between actions, paving the way for zero-shot transfer. We also propose an approach that harnesses Stories to improve feature generation for training zero-shot classification. Without any target dataset fine-tuning, our method achieves new state-of-the-art on multiple benchmarks, improving top-1 accuracy by up to 6.1%. We believe Stories provides a valuable resource that can catalyze progress in zero-shot action recognition. The textual narratives forge connections between seen and unseen classes, overcoming the bottleneck of labeled data that has long impeded advancements in this exciting domain. The data can be found here: https://github.com/kini5gowda/Stories .
Paper Structure (27 sections, 2 equations, 6 figures, 6 tables)

This paper contains 27 sections, 2 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Comparison of accuracy across state-of-the-art ZS approaches using different semantic embeddings: the proposed Stories, word2vec (W2V) and elaborative definitions (ER), on UCF101. Using the proposed Stories to create the semantic space of class labels improves the performance by a large margin across all methods, showing that it is model-agnostic.
  • Figure 2: Comparing nearest neighbors using Stories. We see an example where ER fails and Stories provides more context and helps in obtaining better neighbors. This is one example of where ER fails, there are multiple such examples. Dataset is UCF101.
  • Figure 3: Visualization of the features generated from the embedding vs ER, using t-SNE. We observe that the samples of each class instance, depicted in a single color, are better clustered together, pointing to a more semantically meaningful space.
  • Figure 4: Using Stories for feature generation. The elements depicted in yellow are the standard vanilla approach to feature generation for ZS. Depicted in green are the elements that we introduce.
  • Figure 5: Training the generator using data-driven noise converges much faster than using the standard Gaussian noise.
  • ...and 1 more figures