Table of Contents
Fetching ...

Video Summarization: Towards Entity-Aware Captions

Hammad A. Ayyubi, Tianqi Liu, Arsha Nagrani, Xudong Lin, Mingda Zhang, Anurag Arnab, Feng Han, Yukun Zhu, Jialu Liu, Shih-Fu Chang

TL;DR

A method that augments visual information from videos with context retrieved from external world knowledge to generate entity-aware captions is proposed, which generalizes to existing news image captions dataset.

Abstract

Existing popular video captioning benchmarks and models deal with generic captions devoid of specific person, place or organization named entities. In contrast, news videos present a challenging setting where the caption requires such named entities for meaningful summarization. As such, we propose the task of summarizing news video directly to entity-aware captions. We also release a large-scale dataset, VIEWS (VIdeo NEWS), to support research on this task. Further, we propose a method that augments visual information from videos with context retrieved from external world knowledge to generate entity-aware captions. We demonstrate the effectiveness of our approach on three video captioning models. We also show that our approach generalizes to existing news image captions dataset. With all the extensive experiments and insights, we believe we establish a solid basis for future research on this challenging task.

Video Summarization: Towards Entity-Aware Captions

TL;DR

A method that augments visual information from videos with context retrieved from external world knowledge to generate entity-aware captions is proposed, which generalizes to existing news image captions dataset.

Abstract

Existing popular video captioning benchmarks and models deal with generic captions devoid of specific person, place or organization named entities. In contrast, news videos present a challenging setting where the caption requires such named entities for meaningful summarization. As such, we propose the task of summarizing news video directly to entity-aware captions. We also release a large-scale dataset, VIEWS (VIdeo NEWS), to support research on this task. Further, we propose a method that augments visual information from videos with context retrieved from external world knowledge to generate entity-aware captions. We demonstrate the effectiveness of our approach on three video captioning models. We also show that our approach generalizes to existing news image captions dataset. With all the extensive experiments and insights, we believe we establish a solid basis for future research on this challenging task.
Paper Structure (22 sections, 4 equations, 19 figures, 15 tables)

This paper contains 22 sections, 4 equations, 19 figures, 15 tables.

Figures (19)

  • Figure 1: Top: In the traditional Video Captioning task, captions typically do not contain specific named-entities. Bottom: In the proposed Entity-Aware Video Captioning task, the objective is to generate captions that include relevant entities, such as highlighted names and places. This is necessary for effective captioning in the news domain. The task permits utilization of external knowledge sources.
  • Figure 2: Left: Data Collection Procedure. Right: Samples from our dataset highlighting key properties: captions are rich in entities (colored green), well aligned with the video and diverse (topics shown: politics, conflict & art).
  • Figure 3: Distribution of entities in VIEWS. NORP: Nationalities or religious or political groups. GPE: geopolitical entities (countries, cities, states). ORG: organizations (companies, agencies, institutions).
  • Figure 4: Word cloud of 'PERSON' entities in VIEWS.
  • Figure 5: An overview of our proposed approach. Given the video input, we first use EP to detect entities. Then, KE uses the detected entities to extract contextual knowledge about the video. Finally, video, entities and context are input to CM to generate entity-aware video captions.
  • ...and 14 more figures