Prompt2LVideos: Exploring Prompts for Understanding Long-Form Multimodal Videos
Soumya Shamarao Jahagirdar, Jayasree Saha, C V Jawahar
TL;DR
This work addresses the challenge of understanding long-form educational and news videos by introducing the Edu-News dataset and exploring prompt-based LLM techniques to extract concise, informative captions from ASR transcripts and OCR frames. It evaluates baseline retrieval approaches, showing that traditional TF-IDF-based methods leveraging full-video context can outperform zero-shot dense retrieval models like DPR and SINGULARITY on long-form content, due to context length and frame-rate constraints. The authors design domain-specific prompt templates (education vs. news) and demonstrate a dual-path video retrieval system using OCR tokens or transcripts, with multilingual query support via translation. Overall, Edu-News provides a foundation for prompt-engineered multimodal understanding of long videos and highlights the need for long-range, context-aware retrieval methods in education and news domains.
Abstract
Learning multimodal video understanding typically relies on datasets comprising video clips paired with manually annotated captions. However, this becomes even more challenging when dealing with long-form videos, lasting from minutes to hours, in educational and news domains due to the need for more annotators with subject expertise. Hence, there arises a need for automated solutions. Recent advancements in Large Language Models (LLMs) promise to capture concise and informative content that allows the comprehension of entire videos by leveraging Automatic Speech Recognition (ASR) and Optical Character Recognition (OCR) technologies. ASR provides textual content from audio, while OCR extracts textual content from specific frames. This paper introduces a dataset comprising long-form lectures and news videos. We present baseline approaches to understand their limitations on this dataset and advocate for exploring prompt engineering techniques to comprehend long-form multimodal video datasets comprehensively.
