Video Enriched Retrieval Augmented Generation Using Aligned Video Captions

Kevin Dela Rosa

Video Enriched Retrieval Augmented Generation Using Aligned Video Captions

Kevin Dela Rosa

TL;DR

This work tackles the challenge of incorporating large video corpora into Retrieval Augmented Generation chat systems without overloading the LLM context. It proposes aligned visual captions—temporally synchronized, text-based descriptions derived from video captions and subtitles—to serve as a scalable intermediate representation. A Panda-70M-derived dataset (≈29k videos, ≈1.5M clips) with aligned captions is curated and evaluated, showing that text-only caption signals can approach multimodal inputs in semantic quality while reducing context requirements. Retrieval experiments comparing text embeddings (e.g., Aligned Transcript) with multimodal embeddings demonstrate viable video retrieval performance at practical K values, enabling effective video-enriched RAG. The paper also outlines a reference architecture for a Video Enriched Chat Bot and discusses practical deployment considerations and avenues for future work, such as domain adaptation and richer audio cues.

Abstract

In this work, we propose the use of "aligned visual captions" as a mechanism for integrating information contained within videos into retrieval augmented generation (RAG) based chat assistant systems. These captions are able to describe the visual and audio content of videos in a large corpus while having the advantage of being in a textual format that is both easy to reason about & incorporate into large language model (LLM) prompts, but also typically require less multimedia content to be inserted into the multimodal LLM context window, where typical configurations can aggressively fill up the context window by sampling video frames from the source video. Furthermore, visual captions can be adapted to specific use cases by prompting the original foundational model / captioner for particular visual details or fine tuning. In hopes of helping advancing progress in this area, we curate a dataset and describe automatic evaluation procedures on common RAG tasks.

Video Enriched Retrieval Augmented Generation Using Aligned Video Captions

TL;DR

Abstract

Video Enriched Retrieval Augmented Generation Using Aligned Video Captions

Authors

TL;DR

Abstract

Table of Contents

Figures (2)