Table of Contents
Fetching ...

TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering

Chuyi Shang, Amos You, Sanjay Subramanian, Trevor Darrell, Roei Herzig

TL;DR

This work introduces TraveLER, a modular multi-LMM agent framework for VideoQA that orchestrates Traverse, Locate, Evaluate, and Replan stages under a Planner's guidance. By employing a memory bank and iterative cross-agent feedback, the system revisits video frames to extract targeted information through a question-driven Extractor while the Evaluator decides sufficiency and triggers replanning when needed. Across multiple zero-shot benchmarks (NExT-QA, EgoSchema, STAR, Perception Test), TraveLER demonstrates improved performance without dataset-specific fine-tuning, highlighting the value of adaptive, iterative planning in video understanding. The approach is versatile, compatible with various LLM/LMM backbones, and supported by extensive ablations that confirm the contributions of planning, retrieval, extraction, evaluation, and memory management.

Abstract

Recently, image-based Large Multimodal Models (LMMs) have made significant progress in video question-answering (VideoQA) using a frame-wise approach by leveraging large-scale pretraining in a zero-shot manner. Nevertheless, these models need to be capable of finding relevant information, extracting it, and answering the question simultaneously. Currently, existing methods perform all of these steps in a single pass without being able to adapt if insufficient or incorrect information is collected. To overcome this, we introduce a modular multi-LMM agent framework based on several agents with different roles, instructed by a Planner agent that updates its instructions using shared feedback from the other agents. Specifically, we propose TraveLER, a method that can create a plan to "Traverse" through the video, ask questions about individual frames to "Locate" and store key information, and then "Evaluate" if there is enough information to answer the question. Finally, if there is not enough information, our method is able to "Replan" based on its collected knowledge. Through extensive experiments, we find that the proposed TraveLER approach improves performance on several VideoQA benchmarks without the need to fine-tune on specific datasets. Our code is available at https://github.com/traveler-framework/TraveLER.

TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering

TL;DR

This work introduces TraveLER, a modular multi-LMM agent framework for VideoQA that orchestrates Traverse, Locate, Evaluate, and Replan stages under a Planner's guidance. By employing a memory bank and iterative cross-agent feedback, the system revisits video frames to extract targeted information through a question-driven Extractor while the Evaluator decides sufficiency and triggers replanning when needed. Across multiple zero-shot benchmarks (NExT-QA, EgoSchema, STAR, Perception Test), TraveLER demonstrates improved performance without dataset-specific fine-tuning, highlighting the value of adaptive, iterative planning in video understanding. The approach is versatile, compatible with various LLM/LMM backbones, and supported by extensive ablations that confirm the contributions of planning, retrieval, extraction, evaluation, and memory management.

Abstract

Recently, image-based Large Multimodal Models (LMMs) have made significant progress in video question-answering (VideoQA) using a frame-wise approach by leveraging large-scale pretraining in a zero-shot manner. Nevertheless, these models need to be capable of finding relevant information, extracting it, and answering the question simultaneously. Currently, existing methods perform all of these steps in a single pass without being able to adapt if insufficient or incorrect information is collected. To overcome this, we introduce a modular multi-LMM agent framework based on several agents with different roles, instructed by a Planner agent that updates its instructions using shared feedback from the other agents. Specifically, we propose TraveLER, a method that can create a plan to "Traverse" through the video, ask questions about individual frames to "Locate" and store key information, and then "Evaluate" if there is enough information to answer the question. Finally, if there is not enough information, our method is able to "Replan" based on its collected knowledge. Through extensive experiments, we find that the proposed TraveLER approach improves performance on several VideoQA benchmarks without the need to fine-tune on specific datasets. Our code is available at https://github.com/traveler-framework/TraveLER.
Paper Structure (27 sections, 3 equations, 12 figures, 9 tables)

This paper contains 27 sections, 3 equations, 12 figures, 9 tables.

Figures (12)

  • Figure 1: A simplified overview of our TraveLER framework. Our proposed framework aims to answer the question by collecting relevant information from keyframes through interactive question-asking. To accomplish this, several agents (in colored boxes) with different roles interact (left-to-right in each row) over several iterations. TraveLER creates a plan (in blue) to "traverse" (in orange) through the video, asks questions regarding individual frames (in yellow) to "locate" and store key information and, "evaluates" whether there is sufficient information to answer the question (in green), and "replans" using past collected knowledge if there is not enough information. Click on the image to see the video.
  • Figure 2: TraveLER framework. Our framework consists of four different modules, the Planner, Retriever, Extractor, and Evaluator. The Planner creates a plan and sends it to the Retriever. The Retriever uses the plan to select the next timestamp and sends this to the Extractor. The Extractor captions and generates questions about the timestamp, answers the questions, and saves the output in the memory bank. Finally, the Evaluator determines if there is enough information and if the plan has been followed. If yes, the Evaluator returns the answer, else the existing information is sent back to the Planner to begin a new iteration.
  • Figure 3: Comparison of different Memory Initialization (1, 3, 5 frames). 5 frames is optimal.
  • Figure 4: Comparison of different LMM Response Length (75, 150, 300 max tokens). 150 is optimal.
  • Figure 5: NeXT-QA Success Predictions. We can see that our framework can adapt to new information collected in past iterations. For example, in Iteration 3, our Planner module is able to use information about the wooden post from a previous iteration and ask further questions to identify the correct answer.
  • ...and 7 more figures