Multi-modal News Understanding with Professionally Labelled Videos (ReutersViLNews)

Shih-Han Chou; Matthew Kowal; Yasmin Niknam; Diana Moyano; Shayaan Mehdi; Richard Pito; Cheng Zhang; Ian Knopke; Sedef Akinli Kocak; Leonid Sigal; Yalda Mohsenzadeh

Multi-modal News Understanding with Professionally Labelled Videos (ReutersViLNews)

Shih-Han Chou, Matthew Kowal, Yasmin Niknam, Diana Moyano, Shayaan Mehdi, Richard Pito, Cheng Zhang, Ian Knopke, Sedef Akinli Kocak, Leonid Sigal, Yalda Mohsenzadeh

TL;DR

The paper targets high-level, news-domain video-language understanding by introducing ReutersViLNews, a large-scale, professionally labeled dataset of long-form event videos. It includes eight annotation types across seven topics and 200+ locations in 16 languages. The authors benchmark four tasks—video captioning, video paragraph generation, open-world keyword generation, and video-text retrieval—using multiple baselines, revealing that current models struggle with long-form, abstract news content. Key findings include audio aiding captioning, semantic similarity metrics being necessary for paragraph evaluation, and keyword generation remaining open-vocabulary challenging; TS2-Net leads in retrieval. The work provides a valuable benchmark for advancing multimodal understanding in journalism and suggests directions for richer annotations and evaluation to better capture news-context understanding.

Abstract

While progress has been made in the domain of video-language understanding, current state-of-the-art algorithms are still limited in their ability to understand videos at high levels of abstraction, such as news-oriented videos. Alternatively, humans easily amalgamate information from video and language to infer information beyond what is visually observable in the pixels. An example of this is watching a news story, where the context of the event can play as big of a role in understanding the story as the event itself. Towards a solution for designing this ability in algorithms, we present a large-scale analysis on an in-house dataset collected by the Reuters News Agency, called Reuters Video-Language News (ReutersViLNews) dataset which focuses on high-level video-language understanding with an emphasis on long-form news. The ReutersViLNews Dataset consists of long-form news videos collected and labeled by news industry professionals over several years and contains prominent news reporting from around the world. Each video involves a single story and contains action shots of the actual event, interviews with people associated with the event, footage from nearby areas, and more. ReutersViLNews dataset contains videos from seven subject categories: disaster, finance, entertainment, health, politics, sports, and miscellaneous with annotations from high-level to low-level, title caption, visual video description, high-level story description, keywords, and location. We first present an analysis of the dataset statistics of ReutersViLNews compared to previous datasets. Then we benchmark state-of-the-art approaches for four different video-language tasks. The results suggest that news-oriented videos are a substantial challenge for current video-language understanding algorithms and we conclude by providing future directions in designing approaches to solve the ReutersViLNews dataset.

Multi-modal News Understanding with Professionally Labelled Videos (ReutersViLNews)

TL;DR

Abstract

Paper Structure (12 sections, 6 figures, 6 tables)

This paper contains 12 sections, 6 figures, 6 tables.

Introduction
Related Work
Reuters Video-Language News Dataset
Labels and Metadata
Labeling Guidelines
Dataset Statistics
Experiments
Task 1: Video Captioning
Task 2: Video Paragraph Generation
Task 3: Open World Keyword Generation
Task 4: Video-Text Retrieval
Conclusion

Figures (6)

Figure 1: Dataset Examples. Four example videos from the ReutersViLNews dataset. Frames are shown from four news categories with time increasing from left to right. Note the diversity of the events and visual information in each video clip. The stories covered are (i) Pogba an injury doubt for United ahead of Premier League clash with Arsenal; (ii) New Zealand's COVID-19 cases hit record for second time this week; (iii) Britain will have "good set" of Brexit policies ready for June EU summit - minister; (iv) 'Couldn't think of a better place' to announce baby: Prince Harry in Australia.
Figure 2: Dataset details. Here is an example from the ReutersViLNews dataset. ReutersViLNews contains the video and corresponding metadata, including short title, story, keywords, category, video description for video clips, closed caption, video date, and location. More details about the metadata information are in Section \ref{['subsec:LabelsMetadata']}.
Figure 3: Text Statistics. We show the composition of captions and stories across different categories.
Figure 4: Qualitative Results for Video Captioning. We show two qualitative examples from each model and each setting. As shown in the figure, BMT BMT_Iashin_2020 can generate more related captions compared with others. Anecdotally, one can also see that training with audio input describes the video more thoroughly, as ReutersViLNews contains interviews and press conferences.
Figure 5: Qualitative Results for Video Paragraph Generation. We visualize the video and output paragraph from the baseline song2021paragraph and highlight the related content in green.
...and 1 more figures

Multi-modal News Understanding with Professionally Labelled Videos (ReutersViLNews)

TL;DR

Abstract

Multi-modal News Understanding with Professionally Labelled Videos (ReutersViLNews)

Authors

TL;DR

Abstract

Table of Contents

Figures (6)