Table of Contents
Fetching ...

WikiVideo: Article Generation from Multiple Videos

Alexander Martin, Reno Kriz, William Gantt Walden, Kate Sanders, Hannah Recknor, Eugene Yang, Francis Ferraro, Benjamin Van Durme

Abstract

We introduce the task of grounded article generation with the goal of creating a Wikipedia-style article from multiple diverse videos about real-world events -- from natural disasters to political elections -- where all the information in the article is supported by video evidence. Videos are intuitive sources for retrieval-augmented generation (RAG), but most contemporary RAG workflows focus heavily on text while existing methods for video-based summarization focus on low-level scene understanding rather than high-level event semantics. To close this gap, we introduce WikiVideo, a benchmark consisting of expert-written articles and densely annotated videos that provide evidence for articles' claims, facilitating the integration of video into RAG pipelines and enabling the creation of in-depth content that is grounded in multimodal sources. We further propose Collaborative Article Generation (CAG), a novel interactive method for article creation from multiple videos. CAG leverages an iterative interaction between an r1-style reasoning model and a VideoLLM to draw higher-level inferences about the target event than is possible with VideoLLMs alone, which fixate on low-level visual features. We benchmark state-of-the-art VideoLLMs and CAG in both oracle retrieval and RAG settings and find that CAG consistently outperforms alternative methods, while suggesting intriguing avenues for future work.

WikiVideo: Article Generation from Multiple Videos

Abstract

We introduce the task of grounded article generation with the goal of creating a Wikipedia-style article from multiple diverse videos about real-world events -- from natural disasters to political elections -- where all the information in the article is supported by video evidence. Videos are intuitive sources for retrieval-augmented generation (RAG), but most contemporary RAG workflows focus heavily on text while existing methods for video-based summarization focus on low-level scene understanding rather than high-level event semantics. To close this gap, we introduce WikiVideo, a benchmark consisting of expert-written articles and densely annotated videos that provide evidence for articles' claims, facilitating the integration of video into RAG pipelines and enabling the creation of in-depth content that is grounded in multimodal sources. We further propose Collaborative Article Generation (CAG), a novel interactive method for article creation from multiple videos. CAG leverages an iterative interaction between an r1-style reasoning model and a VideoLLM to draw higher-level inferences about the target event than is possible with VideoLLMs alone, which fixate on low-level visual features. We benchmark state-of-the-art VideoLLMs and CAG in both oracle retrieval and RAG settings and find that CAG consistently outperforms alternative methods, while suggesting intriguing avenues for future work.

Paper Structure

This paper contains 44 sections, 1 equation, 12 figures, 22 tables.

Figures (12)

  • Figure 1: WikiVideo introduces the task of Article Generation from Multiple Videos, which requires writing a high-level article in the style of a Wikipedia lead, given a target event ($T$), a query about that event ($Q$), and a collection of $Q$-relevant videos ($V$). All claims in the article are grounded in visual, audio, and/or OCR content of video(s) in $V$ (indicated by matching colors between text and frame borders above).
  • Figure 2: The WikiVideo curation process. (1) Sentences in Wikipedia lead sections are decomposed into subclaims. (2) Subclaims are grounded in audio, video, and/or OCR evidence. (3) Leads are rewritten to cover only the grounded information.
  • Figure 3: CAG involves an iterative exchange between (1) a VideoLLM that generates per-video summaries and (2) a reasoning model that evaluates them and produces more event-targeted prompts that are then fed back to the VideoLLM to obtain more comprehensive summaries. Finally, a text-only LLM (3) aggregates these summaries into an full article. Boxes A and B show shortened reasoning chains from the reasoner.
  • Figure 4: Prompt For Qwen 2.5 32B Claim Decomposition
  • Figure 5: The annotation interface for our subclaim grounding task. In this protocol, the left hand side is both versions of the Wikipedia context. The top context is the paragraph a sentence comes from and the bottom context is the lead section of the Wikipedia article. On the right hand side is the sentence to be decomposed and its claims. The claims from Qwen32B are prepopulated in the protocol and the rewriters edit them.
  • ...and 7 more figures