Table of Contents
Fetching ...

AVATAAR: Agentic Video Answering via Temporal Adaptive Alignment and Reasoning

Urjitkumar Patel, Fang-Chun Yeh, Chinmay Gondhalekar

TL;DR

AVATAAR tackles long-form video QA by combining a persistent global video summary with local, query-driven evidence retrieval and an iterative think–retrieve–rethink loop. The architecture introduces a Pre Retrieval Thinking Agent for dynamic query refinement and a Rethink Module to diagnose gaps and steer subsequent retrieval, forming a robust agentic RAG workflow. On the CinePile benchmark, AVATAAR achieves relative gains across temporal reasoning ($+5.6\%$), technical queries ($+5.0\%$), theme-based questions ($+8.0\%$), and narrative comprehension ($+8.2\%$), while ablations demonstrate each module’s positive contribution and the critical role of the feedback loop. The work presents a scalable, interpretable framework that blends global memory with adaptive retrieval for long-form video understanding, suitable for enterprise deployments and future LVLM-driven multimodal research.

Abstract

With the increasing prevalence of video content, effectively understanding and answering questions about long form videos has become essential for numerous applications. Although large vision language models (LVLMs) have enhanced performance, they often face challenges with nuanced queries that demand both a comprehensive understanding and detailed analysis. To overcome these obstacles, we introduce AVATAAR, a modular and interpretable framework that combines global and local video context, along with a Pre Retrieval Thinking Agent and a Rethink Module. AVATAAR creates a persistent global summary and establishes a feedback loop between the Rethink Module and the Pre Retrieval Thinking Agent, allowing the system to refine its retrieval strategies based on partial answers and replicate human-like iterative reasoning. On the CinePile benchmark, AVATAAR demonstrates significant improvements over a baseline, achieving relative gains of +5.6% in temporal reasoning, +5% in technical queries, +8% in theme-based questions, and +8.2% in narrative comprehension. Our experiments confirm that each module contributes positively to the overall performance, with the feedback loop being crucial for adaptability. These findings highlight AVATAAR's effectiveness in enhancing video understanding capabilities. Ultimately, AVATAAR presents a scalable solution for long-form Video Question Answering (QA), merging accuracy, interpretability, and extensibility.

AVATAAR: Agentic Video Answering via Temporal Adaptive Alignment and Reasoning

TL;DR

AVATAAR tackles long-form video QA by combining a persistent global video summary with local, query-driven evidence retrieval and an iterative think–retrieve–rethink loop. The architecture introduces a Pre Retrieval Thinking Agent for dynamic query refinement and a Rethink Module to diagnose gaps and steer subsequent retrieval, forming a robust agentic RAG workflow. On the CinePile benchmark, AVATAAR achieves relative gains across temporal reasoning (), technical queries (), theme-based questions (), and narrative comprehension (), while ablations demonstrate each module’s positive contribution and the critical role of the feedback loop. The work presents a scalable, interpretable framework that blends global memory with adaptive retrieval for long-form video understanding, suitable for enterprise deployments and future LVLM-driven multimodal research.

Abstract

With the increasing prevalence of video content, effectively understanding and answering questions about long form videos has become essential for numerous applications. Although large vision language models (LVLMs) have enhanced performance, they often face challenges with nuanced queries that demand both a comprehensive understanding and detailed analysis. To overcome these obstacles, we introduce AVATAAR, a modular and interpretable framework that combines global and local video context, along with a Pre Retrieval Thinking Agent and a Rethink Module. AVATAAR creates a persistent global summary and establishes a feedback loop between the Rethink Module and the Pre Retrieval Thinking Agent, allowing the system to refine its retrieval strategies based on partial answers and replicate human-like iterative reasoning. On the CinePile benchmark, AVATAAR demonstrates significant improvements over a baseline, achieving relative gains of +5.6% in temporal reasoning, +5% in technical queries, +8% in theme-based questions, and +8.2% in narrative comprehension. Our experiments confirm that each module contributes positively to the overall performance, with the feedback loop being crucial for adaptability. These findings highlight AVATAAR's effectiveness in enhancing video understanding capabilities. Ultimately, AVATAAR presents a scalable solution for long-form Video Question Answering (QA), merging accuracy, interpretability, and extensibility.

Paper Structure

This paper contains 19 sections, 15 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Traditional video question answering pipeline
  • Figure 2: AVATAAR: Think, Retrieve, Rethink - Agentic Video QA Framework
  • Figure 3: F1 Score by System Variant and Category