Table of Contents
Fetching ...

Human Action Co-occurrence in Lifestyle Vlogs using Graph Link Prediction

Oana Ignat, Santiago Castro, Weiji Li, Rada Mihalcea

TL;DR

This paper tackles automatic identification of which human actions co-occur in videos by modeling actions as nodes in a co-occurrence graph and predicting links between them. It introduces the ACE/Co-Act dataset built from lifestyle vlogs, combining textual transcripts, visual signals, and graph topology to learn rich action representations. A spectrum of baselines—random, heuristic topology, embeddings, and learning-based models—demonstrates that graph-informed, multi-modal approaches yield strong performance, with the SVM using all modalities achieving the highest accuracy. The work shows that graph-based action representations capture cross-domain relations and location cues, enabling better action retrieval and diversity, thus advancing practical action understanding and providing a valuable resource for future research.

Abstract

We introduce the task of automatic human action co-occurrence identification, i.e., determine whether two human actions can co-occur in the same interval of time. We create and make publicly available the ACE (Action Co-occurrencE) dataset, consisting of a large graph of ~12k co-occurring pairs of visual actions and their corresponding video clips. We describe graph link prediction models that leverage visual and textual information to automatically infer if two actions are co-occurring. We show that graphs are particularly well suited to capture relations between human actions, and the learned graph representations are effective for our task and capture novel and relevant information across different data domains. The ACE dataset and the code introduced in this paper are publicly available at https://github.com/MichiganNLP/vlog_action_co-occurrence.

Human Action Co-occurrence in Lifestyle Vlogs using Graph Link Prediction

TL;DR

This paper tackles automatic identification of which human actions co-occur in videos by modeling actions as nodes in a co-occurrence graph and predicting links between them. It introduces the ACE/Co-Act dataset built from lifestyle vlogs, combining textual transcripts, visual signals, and graph topology to learn rich action representations. A spectrum of baselines—random, heuristic topology, embeddings, and learning-based models—demonstrates that graph-informed, multi-modal approaches yield strong performance, with the SVM using all modalities achieving the highest accuracy. The work shows that graph-based action representations capture cross-domain relations and location cues, enabling better action retrieval and diversity, thus advancing practical action understanding and providing a valuable resource for future research.

Abstract

We introduce the task of automatic human action co-occurrence identification, i.e., determine whether two human actions can co-occur in the same interval of time. We create and make publicly available the ACE (Action Co-occurrencE) dataset, consisting of a large graph of ~12k co-occurring pairs of visual actions and their corresponding video clips. We describe graph link prediction models that leverage visual and textual information to automatically infer if two actions are co-occurring. We show that graphs are particularly well suited to capture relations between human actions, and the learned graph representations are effective for our task and capture novel and relevant information across different data domains. The ACE dataset and the code introduced in this paper are publicly available at https://github.com/MichiganNLP/vlog_action_co-occurrence.
Paper Structure (49 sections, 7 equations, 8 figures, 4 tables)

This paper contains 49 sections, 7 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: We draw inspiration from contextual word representations to create novel action representations based on video temporal context. Specifically, when predicting the next word in a sentence, it is more expected to see certain words, for instance, after "Today is" an expected word is "sunny" and not "apple". Similarly, human actions also follow a certain pattern, for instance, after "waking up", an expected next action is to "wash the face" and not to "clean the house".
  • Figure 2: Top three action neighbors, obtained from textual (blue) and graph-based (purple) representations, for three random action queries from our dataset: "rub stain", "build desk", "chop potato".
  • Figure 3: Co-occurrence matrix for the top 20 most frequent actions in our dataset, Co-Act. The scores are computed using the PPMI measure: actions with higher scores have a stronger co-occurrence relation and vice-versa. For better visualization, we sort the matrix rows to highlight clusters. Best viewed in color.
  • Figure 4: Co-occurrence matrix for the top 50 most frequent actions in our dataset, Co-Act. The scores are computed using the PPMI measure: actions with higher scores have a stronger co-occurrence relation and vice-versa. For better visualization, we sort the matrix rows to highlight clusters. Best viewed in color.
  • Figure 5: Co-occurrence matrix for the top 50 most frequent verbs in our dataset, Co-Act. The scores are computed using the PPMI measure: actions with higher scores have a stronger co-occurrence relation and vice-versa. For better visualization, we sort the matrix rows to highlight clusters. Best viewed in color.
  • ...and 3 more figures