Table of Contents
Fetching ...

Multi-RAG: A Multimodal Retrieval-Augmented Generation System for Adaptive Video Understanding

Mingyang Mao, Mariela M. Perez-Cabarcas, Utteja Kallakuri, Nicholas R. Waytowich, Xiaomin Lin, Tinoosh Mohsenin

TL;DR

The paper presents Multi-RAG, a multimodal retrieval-augmented generation system designed to support adaptive human cognition in information-dense environments. By converting video, audio, and text inputs into unified textual representations via a video encoder, a Markdown-based knowledge base, and a vector-DB RAG pipeline, the system performs semantic retrieval and LLM-driven synthesis guided by two prompt styles. Evaluations on the MMBench-Video benchmark show Multi-RAG achieving competitive or superior results compared with open-source Video-LLMs and LVLMs while using fewer resources and lower frame rates, with audio and auxiliary metadata providing notable gains. The work demonstrates a practical, scalable foundation for future human-robot adaptive assistance, highlighting design choices (sampling, chunking, embeddings, and prompting) that impact real-time multimodal reasoning and decision support.

Abstract

To effectively engage in human society, the ability to adapt, filter information, and make informed decisions in ever-changing situations is critical. As robots and intelligent agents become more integrated into human life, there is a growing opportunity-and need-to offload the cognitive burden on humans to these systems, particularly in dynamic, information-rich scenarios. To fill this critical need, we present Multi-RAG, a multimodal retrieval-augmented generation system designed to provide adaptive assistance to humans in information-intensive circumstances. Our system aims to improve situational understanding and reduce cognitive load by integrating and reasoning over multi-source information streams, including video, audio, and text. As an enabling step toward long-term human-robot partnerships, Multi-RAG explores how multimodal information understanding can serve as a foundation for adaptive robotic assistance in dynamic, human-centered situations. To evaluate its capability in a realistic human-assistance proxy task, we benchmarked Multi-RAG on the MMBench-Video dataset, a challenging multimodal video understanding benchmark. Our system achieves superior performance compared to existing open-source video large language models (Video-LLMs) and large vision-language models (LVLMs), while utilizing fewer resources and less input data. The results demonstrate Multi- RAG's potential as a practical and efficient foundation for future human-robot adaptive assistance systems in dynamic, real-world contexts.

Multi-RAG: A Multimodal Retrieval-Augmented Generation System for Adaptive Video Understanding

TL;DR

The paper presents Multi-RAG, a multimodal retrieval-augmented generation system designed to support adaptive human cognition in information-dense environments. By converting video, audio, and text inputs into unified textual representations via a video encoder, a Markdown-based knowledge base, and a vector-DB RAG pipeline, the system performs semantic retrieval and LLM-driven synthesis guided by two prompt styles. Evaluations on the MMBench-Video benchmark show Multi-RAG achieving competitive or superior results compared with open-source Video-LLMs and LVLMs while using fewer resources and lower frame rates, with audio and auxiliary metadata providing notable gains. The work demonstrates a practical, scalable foundation for future human-robot adaptive assistance, highlighting design choices (sampling, chunking, embeddings, and prompting) that impact real-time multimodal reasoning and decision support.

Abstract

To effectively engage in human society, the ability to adapt, filter information, and make informed decisions in ever-changing situations is critical. As robots and intelligent agents become more integrated into human life, there is a growing opportunity-and need-to offload the cognitive burden on humans to these systems, particularly in dynamic, information-rich scenarios. To fill this critical need, we present Multi-RAG, a multimodal retrieval-augmented generation system designed to provide adaptive assistance to humans in information-intensive circumstances. Our system aims to improve situational understanding and reduce cognitive load by integrating and reasoning over multi-source information streams, including video, audio, and text. As an enabling step toward long-term human-robot partnerships, Multi-RAG explores how multimodal information understanding can serve as a foundation for adaptive robotic assistance in dynamic, human-centered situations. To evaluate its capability in a realistic human-assistance proxy task, we benchmarked Multi-RAG on the MMBench-Video dataset, a challenging multimodal video understanding benchmark. Our system achieves superior performance compared to existing open-source video large language models (Video-LLMs) and large vision-language models (LVLMs), while utilizing fewer resources and less input data. The results demonstrate Multi- RAG's potential as a practical and efficient foundation for future human-robot adaptive assistance systems in dynamic, real-world contexts.

Paper Structure

This paper contains 26 sections, 2 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: An example interaction scenario with the Multi-RAG system: Based on an open-ended user query, the system synthesizes visual, audio, and textual inputs to infer the answer, demonstrating its ability for practical understanding and decision support.
  • Figure 2: The framework of our Multi-RAG pipeline contains three key modules. In the video encoder, the VLM, receives input from the frame sampler, and the ASR processes visual and audio streams, respectively, converting multimodal external inputs into unified textual descriptions and embeddings. Next, these are stored in the knowledge vector database, where a retrieval-augmented generation (RAG) agent selects relevant information in response to the user query. The LLM then synthesizes the retrieved content to generate the final system response.
  • Figure 3: Comparison of mainstream Video-LLMs, LVLMs, and our system on MMBench-Video. This graph illustrates performance in the Level 2 capabilities from the QA setting in the MMBench-Video benchmark. Those are Coarse Perception (CP), Fine-grained Perception with Single-Instance (FP-S), Fine-grained Perception with Cross-Instance (FP-C), Hallucination (HL), Logic Reasoning (LR), Attribute Reasoning (AR), Relation Reasoning (RR), Common Sense Reasoning (CSR), and Temporal Reasoning (TR).
  • Figure 4: The Multi-RAG user interface is organized into three main columns. The left column serves as the dialogue section, displaying transcripts of ongoing conversations. The central column is dedicated to the Q&A area, where the system presents answers to user queries. The right column focuses on video content and is divided into two parts: the upper section provides rolling or overall video summaries, while the lower section displays detailed video descriptions, including selected frames and corresponding audio transcripts.