Table of Contents
Fetching ...

Prompt2LVideos: Exploring Prompts for Understanding Long-Form Multimodal Videos

Soumya Shamarao Jahagirdar, Jayasree Saha, C V Jawahar

TL;DR

This work addresses the challenge of understanding long-form educational and news videos by introducing the Edu-News dataset and exploring prompt-based LLM techniques to extract concise, informative captions from ASR transcripts and OCR frames. It evaluates baseline retrieval approaches, showing that traditional TF-IDF-based methods leveraging full-video context can outperform zero-shot dense retrieval models like DPR and SINGULARITY on long-form content, due to context length and frame-rate constraints. The authors design domain-specific prompt templates (education vs. news) and demonstrate a dual-path video retrieval system using OCR tokens or transcripts, with multilingual query support via translation. Overall, Edu-News provides a foundation for prompt-engineered multimodal understanding of long videos and highlights the need for long-range, context-aware retrieval methods in education and news domains.

Abstract

Learning multimodal video understanding typically relies on datasets comprising video clips paired with manually annotated captions. However, this becomes even more challenging when dealing with long-form videos, lasting from minutes to hours, in educational and news domains due to the need for more annotators with subject expertise. Hence, there arises a need for automated solutions. Recent advancements in Large Language Models (LLMs) promise to capture concise and informative content that allows the comprehension of entire videos by leveraging Automatic Speech Recognition (ASR) and Optical Character Recognition (OCR) technologies. ASR provides textual content from audio, while OCR extracts textual content from specific frames. This paper introduces a dataset comprising long-form lectures and news videos. We present baseline approaches to understand their limitations on this dataset and advocate for exploring prompt engineering techniques to comprehend long-form multimodal video datasets comprehensively.

Prompt2LVideos: Exploring Prompts for Understanding Long-Form Multimodal Videos

TL;DR

This work addresses the challenge of understanding long-form educational and news videos by introducing the Edu-News dataset and exploring prompt-based LLM techniques to extract concise, informative captions from ASR transcripts and OCR frames. It evaluates baseline retrieval approaches, showing that traditional TF-IDF-based methods leveraging full-video context can outperform zero-shot dense retrieval models like DPR and SINGULARITY on long-form content, due to context length and frame-rate constraints. The authors design domain-specific prompt templates (education vs. news) and demonstrate a dual-path video retrieval system using OCR tokens or transcripts, with multilingual query support via translation. Overall, Edu-News provides a foundation for prompt-engineered multimodal understanding of long videos and highlights the need for long-range, context-aware retrieval methods in education and news domains.

Abstract

Learning multimodal video understanding typically relies on datasets comprising video clips paired with manually annotated captions. However, this becomes even more challenging when dealing with long-form videos, lasting from minutes to hours, in educational and news domains due to the need for more annotators with subject expertise. Hence, there arises a need for automated solutions. Recent advancements in Large Language Models (LLMs) promise to capture concise and informative content that allows the comprehension of entire videos by leveraging Automatic Speech Recognition (ASR) and Optical Character Recognition (OCR) technologies. ASR provides textual content from audio, while OCR extracts textual content from specific frames. This paper introduces a dataset comprising long-form lectures and news videos. We present baseline approaches to understand their limitations on this dataset and advocate for exploring prompt engineering techniques to comprehend long-form multimodal video datasets comprehensively.

Paper Structure

This paper contains 12 sections, 10 figures, 2 tables.

Figures (10)

  • Figure 1: An illustration showcasing the diverse data found within educational and news videos adds intrigue to these domains. Exploring such data holds promise for advancing long video understanding tasks.
  • Figure 2: Edu-News: Dataset for video understanding on educational and news videos.
  • Figure 3: Insights into the distribution of data investigated in this study are provided. In (a), the distribution of topics in NPTEL videos is illustrated, demonstrating a uniform spread across topics. Notably, the experiments outlined in this work can extend to encompass a broader range of topics and diverse lecture video content. In (b) and (c), we depict the distributions of word embeddings of OCR tokens and transcripts, respectively, in both educational and news videos. We employ clustering techniques with k=10 clusters to visualize the diverse array of video content types present in the dataset.
  • Figure 4: Word cloud representation of video content in the dataset. (a) Illustrates the word distribution within OCR tokens extracted from (a) NPTEL videos and (b) news videos. Displays the word distribution within transcripts of (c) NPTEL videos and (d) news videos.
  • Figure 5: Pipeline Overview: Initial processing involves leveraging multimodal cues such as OCR tokens and transcripts to analyze long-range videos. Subsequently, retrieval is performed using either captions generated by ChatGPT or alternative queries. Additionally, our system accommodates queries in Indian languages.
  • ...and 5 more figures