Table of Contents
Fetching ...

DocVideoQA: Towards Comprehensive Understanding of Document-Centric Videos through Question Answering

Haochen Wang, Kai Hu, Liangcai Gao

TL;DR

This work introduces DocVideoQA, a new task and dataset for understanding document-centric videos that combine dense textual content, visuals, and speech. It presents DV-LLaMA, a multi-branch, multi-stage MLLM that progressively enhances unimodal features, aligns visual and audio content via contrastive learning, and fuses modalities through instruction-tuned QA training, achieving substantial gains over open-source baselines. The dataset comprises 1,454 videos across 23 domains with 154K QA pairs, and the authors provide the code and data to spur future research in document-focused video understanding. Overall, the approach advances multimodal reasoning in document-rich video contexts and offers practical tools for education and remote-work scenarios.

Abstract

Remote work and online courses have become important methods of knowledge dissemination, leading to a large number of document-based instructional videos. Unlike traditional video datasets, these videos mainly feature rich-text images and audio that are densely packed with information closely tied to the visual content, requiring advanced multimodal understanding capabilities. However, this domain remains underexplored due to dataset availability and its inherent complexity. In this paper, we introduce the DocVideoQA task and dataset for the first time, comprising 1454 videos across 23 categories with a total duration of about 828 hours. The dataset is annotated with 154k question-answer pairs generated manually and via GPT, assessing models' comprehension, temporal awareness, and modality integration capabilities. Initially, we establish a baseline using open-source MLLMs. Recognizing the challenges in modality comprehension for document-centric videos, we present DV-LLaMA, a robust video MLLM baseline. Our method enhances unimodal feature extraction with diverse instruction-tuning data and employs contrastive learning to strengthen modality integration. Through fine-tuning, the LLM is equipped with audio-visual capabilities, leading to significant improvements in document-centric video understanding. Extensive testing on the DocVideoQA dataset shows that DV-LLaMA significantly outperforms existing models. We'll release the code and dataset to facilitate future research.

DocVideoQA: Towards Comprehensive Understanding of Document-Centric Videos through Question Answering

TL;DR

This work introduces DocVideoQA, a new task and dataset for understanding document-centric videos that combine dense textual content, visuals, and speech. It presents DV-LLaMA, a multi-branch, multi-stage MLLM that progressively enhances unimodal features, aligns visual and audio content via contrastive learning, and fuses modalities through instruction-tuned QA training, achieving substantial gains over open-source baselines. The dataset comprises 1,454 videos across 23 domains with 154K QA pairs, and the authors provide the code and data to spur future research in document-focused video understanding. Overall, the approach advances multimodal reasoning in document-rich video contexts and offers practical tools for education and remote-work scenarios.

Abstract

Remote work and online courses have become important methods of knowledge dissemination, leading to a large number of document-based instructional videos. Unlike traditional video datasets, these videos mainly feature rich-text images and audio that are densely packed with information closely tied to the visual content, requiring advanced multimodal understanding capabilities. However, this domain remains underexplored due to dataset availability and its inherent complexity. In this paper, we introduce the DocVideoQA task and dataset for the first time, comprising 1454 videos across 23 categories with a total duration of about 828 hours. The dataset is annotated with 154k question-answer pairs generated manually and via GPT, assessing models' comprehension, temporal awareness, and modality integration capabilities. Initially, we establish a baseline using open-source MLLMs. Recognizing the challenges in modality comprehension for document-centric videos, we present DV-LLaMA, a robust video MLLM baseline. Our method enhances unimodal feature extraction with diverse instruction-tuning data and employs contrastive learning to strengthen modality integration. Through fine-tuning, the LLM is equipped with audio-visual capabilities, leading to significant improvements in document-centric video understanding. Extensive testing on the DocVideoQA dataset shows that DV-LLaMA significantly outperforms existing models. We'll release the code and dataset to facilitate future research.

Paper Structure

This paper contains 12 sections, 2 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: An illustrative example of DocVideoQA: a video on Fundraising Pushes challenges models with tasks such as information extraction, multi-page content comprehension, and visual-audio understanding, necessitating a multidimensional analysis of the video.
  • Figure 2: Overview of DV-LLaMA and stage-wise training