Table of Contents
Fetching ...

The ActivityNet Large-Scale Activity Recognition Challenge 2018 Summary

Bernard Ghanem, Juan Carlos Niebles, Cees Snoek, Fabian Caba Heilbron, Humam Alwassel, Victor Escorcia, Ranjay Krishna, Shyamal Buch, Cuong Duc Dao

TL;DR

The paper documents the 2018 ActivityNet Challenge, detailing six tasks that advance large-scale, semantic video understanding: three core tasks based on ActivityNet (temporal proposals, temporal localization, and dense captioning) and three hosted guest tasks utilizing Kinetics-600, AVA, and Moments in Time. It describes each task's objective, dataset, and evaluation metrics, and reports the top submissions per task. The work highlights a multi-faceted approach to translating visual content into structured activity evidence and natural language descriptions, fostering progress in scalable video analysis and captioning. The integration of both detection/localization and captioning tasks under a common benchmark ecosystem underscores the practical relevance for internet-scale video understanding and captioning.

Abstract

The 3rd annual installment of the ActivityNet Large- Scale Activity Recognition Challenge, held as a full-day workshop in CVPR 2018, focused on the recognition of daily life, high-level, goal-oriented activities from user-generated videos as those found in internet video portals. The 2018 challenge hosted six diverse tasks which aimed to push the limits of semantic visual understanding of videos as well as bridge visual content with human captions. Three out of the six tasks were based on the ActivityNet dataset, which was introduced in CVPR 2015 and organized hierarchically in a semantic taxonomy. These tasks focused on tracing evidence of activities in time in the form of proposals, class labels, and captions. In this installment of the challenge, we hosted three guest tasks to enrich the understanding of visual information in videos. The guest tasks focused on complementary aspects of the activity recognition problem at large scale and involved three challenging and recently compiled datasets: the Kinetics-600 dataset from Google DeepMind, the AVA dataset from Berkeley and Google, and the Moments in Time dataset from MIT and IBM Research.

The ActivityNet Large-Scale Activity Recognition Challenge 2018 Summary

TL;DR

The paper documents the 2018 ActivityNet Challenge, detailing six tasks that advance large-scale, semantic video understanding: three core tasks based on ActivityNet (temporal proposals, temporal localization, and dense captioning) and three hosted guest tasks utilizing Kinetics-600, AVA, and Moments in Time. It describes each task's objective, dataset, and evaluation metrics, and reports the top submissions per task. The work highlights a multi-faceted approach to translating visual content into structured activity evidence and natural language descriptions, fostering progress in scalable video analysis and captioning. The integration of both detection/localization and captioning tasks under a common benchmark ecosystem underscores the practical relevance for internet-scale video understanding and captioning.

Abstract

The 3rd annual installment of the ActivityNet Large- Scale Activity Recognition Challenge, held as a full-day workshop in CVPR 2018, focused on the recognition of daily life, high-level, goal-oriented activities from user-generated videos as those found in internet video portals. The 2018 challenge hosted six diverse tasks which aimed to push the limits of semantic visual understanding of videos as well as bridge visual content with human captions. Three out of the six tasks were based on the ActivityNet dataset, which was introduced in CVPR 2015 and organized hierarchically in a semantic taxonomy. These tasks focused on tracing evidence of activities in time in the form of proposals, class labels, and captions. In this installment of the challenge, we hosted three guest tasks to enrich the understanding of visual information in videos. The guest tasks focused on complementary aspects of the activity recognition problem at large scale and involved three challenging and recently compiled datasets: the Kinetics-600 dataset from Google DeepMind, the AVA dataset from Berkeley and Google, and the Moments in Time dataset from MIT and IBM Research.

Paper Structure

This paper contains 9 sections, 8 tables.