Multi-RAG: A Multimodal Retrieval-Augmented Generation System for Adaptive Video Understanding

Mingyang Mao; Mariela M. Perez-Cabarcas; Utteja Kallakuri; Nicholas R. Waytowich; Xiaomin Lin; Tinoosh Mohsenin

Multi-RAG: A Multimodal Retrieval-Augmented Generation System for Adaptive Video Understanding

Mingyang Mao, Mariela M. Perez-Cabarcas, Utteja Kallakuri, Nicholas R. Waytowich, Xiaomin Lin, Tinoosh Mohsenin

TL;DR

The paper presents Multi-RAG, a multimodal retrieval-augmented generation system designed to support adaptive human cognition in information-dense environments. By converting video, audio, and text inputs into unified textual representations via a video encoder, a Markdown-based knowledge base, and a vector-DB RAG pipeline, the system performs semantic retrieval and LLM-driven synthesis guided by two prompt styles. Evaluations on the MMBench-Video benchmark show Multi-RAG achieving competitive or superior results compared with open-source Video-LLMs and LVLMs while using fewer resources and lower frame rates, with audio and auxiliary metadata providing notable gains. The work demonstrates a practical, scalable foundation for future human-robot adaptive assistance, highlighting design choices (sampling, chunking, embeddings, and prompting) that impact real-time multimodal reasoning and decision support.

Abstract

To effectively engage in human society, the ability to adapt, filter information, and make informed decisions in ever-changing situations is critical. As robots and intelligent agents become more integrated into human life, there is a growing opportunity-and need-to offload the cognitive burden on humans to these systems, particularly in dynamic, information-rich scenarios. To fill this critical need, we present Multi-RAG, a multimodal retrieval-augmented generation system designed to provide adaptive assistance to humans in information-intensive circumstances. Our system aims to improve situational understanding and reduce cognitive load by integrating and reasoning over multi-source information streams, including video, audio, and text. As an enabling step toward long-term human-robot partnerships, Multi-RAG explores how multimodal information understanding can serve as a foundation for adaptive robotic assistance in dynamic, human-centered situations. To evaluate its capability in a realistic human-assistance proxy task, we benchmarked Multi-RAG on the MMBench-Video dataset, a challenging multimodal video understanding benchmark. Our system achieves superior performance compared to existing open-source video large language models (Video-LLMs) and large vision-language models (LVLMs), while utilizing fewer resources and less input data. The results demonstrate Multi- RAG's potential as a practical and efficient foundation for future human-robot adaptive assistance systems in dynamic, real-world contexts.

Multi-RAG: A Multimodal Retrieval-Augmented Generation System for Adaptive Video Understanding

TL;DR

Abstract

Multi-RAG: A Multimodal Retrieval-Augmented Generation System for Adaptive Video Understanding

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)