Table of Contents
Fetching ...

A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos

Mohammed Irfan Kurpath, Jaseel Muhammad Kaithakkodan, Jinxing Zhou, Sahal Shaji Mullappilly, Mohammad Almansoori, Noor Ahsan, Beknur Kalmakhanbet, Sambal Shikhar, Rishabh Lalla, Jean Lahoud, Mariette Awad, Fahad Shahbaz Khan, Salman Khan, Rao Muhammad Anwer, Hisham Cholakkal

TL;DR

This work introduces LongShOTBench, a diagnostic, open-ended benchmark for long-form, omni-modal video understanding that fuses vision, audio, and speech with intent-driven Q&A and rubric-based scoring.It couples the benchmark with LongShOTAgent, a training-free, modular agent that orchestrates specialized models and external tools to perform iterative, tool-augmented reasoning on hour-long videos.Through a scalable five-stage construction pipeline and a rigorous human validation process, the authors demonstrate notable gaps in current models, highlight the advantage of agentic coordination, and provide a practical foundation for advancing real-world long-video understanding.

Abstract

Long-form multimodal video understanding requires integrating vision, speech, and ambient audio with coherent long-range reasoning. Existing benchmarks emphasize either temporal length or multimodal richness, but rarely both and while some incorporate open-ended questions and advanced metrics, they mostly rely on single-score accuracy, obscuring failure modes. We introduce LongShOTBench, a diagnostic benchmark with open-ended, intent-driven questions; single- and multi-turn dialogues; and tasks requiring multimodal reasoning and agentic tool use across video, audio, and speech. Each item includes a reference answer and graded rubric for interpretable, and traceable evaluation. LongShOTBench is produced via a scalable, human-validated pipeline to ensure coverage and reproducibility. All samples in our LongShOTBench are human-verified and corrected. Furthermore, we present LongShOTAgent, an agentic system that analyzes long videos via preprocessing, search, and iterative refinement. On LongShOTBench, state-of-the-art MLLMs show large gaps: Gemini-2.5-Flash achieves 52.95%, open-source models remain below 30%, and LongShOTAgent attains 44.66%. These results underscore the difficulty of real-world long-form video understanding. LongShOTBench provides a practical, reproducible foundation for evaluating and improving MLLMs. All resources are available on GitHub: https://github.com/mbzuai-oryx/longshot.

A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos

TL;DR

This work introduces LongShOTBench, a diagnostic, open-ended benchmark for long-form, omni-modal video understanding that fuses vision, audio, and speech with intent-driven Q&A and rubric-based scoring.It couples the benchmark with LongShOTAgent, a training-free, modular agent that orchestrates specialized models and external tools to perform iterative, tool-augmented reasoning on hour-long videos.Through a scalable five-stage construction pipeline and a rigorous human validation process, the authors demonstrate notable gaps in current models, highlight the advantage of agentic coordination, and provide a practical foundation for advancing real-world long-video understanding.

Abstract

Long-form multimodal video understanding requires integrating vision, speech, and ambient audio with coherent long-range reasoning. Existing benchmarks emphasize either temporal length or multimodal richness, but rarely both and while some incorporate open-ended questions and advanced metrics, they mostly rely on single-score accuracy, obscuring failure modes. We introduce LongShOTBench, a diagnostic benchmark with open-ended, intent-driven questions; single- and multi-turn dialogues; and tasks requiring multimodal reasoning and agentic tool use across video, audio, and speech. Each item includes a reference answer and graded rubric for interpretable, and traceable evaluation. LongShOTBench is produced via a scalable, human-validated pipeline to ensure coverage and reproducibility. All samples in our LongShOTBench are human-verified and corrected. Furthermore, we present LongShOTAgent, an agentic system that analyzes long videos via preprocessing, search, and iterative refinement. On LongShOTBench, state-of-the-art MLLMs show large gaps: Gemini-2.5-Flash achieves 52.95%, open-source models remain below 30%, and LongShOTAgent attains 44.66%. These results underscore the difficulty of real-world long-form video understanding. LongShOTBench provides a practical, reproducible foundation for evaluating and improving MLLMs. All resources are available on GitHub: https://github.com/mbzuai-oryx/longshot.

Paper Structure

This paper contains 31 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Construction pipeline of LongShOTBench. The pipeline begins with raw video data where speech, visuals, and audio cues are extracted. These are passed into multimodal processing to generate segment-wise aligned and fused metadata. Only the distilled information flows to question design, where scenarios and question types are mapped, followed by the generation of questions and conversational answers. Next, verifiable rubrics are created to evaluate correctness and difficulty. Finally, the core dataset, comprising Q&A pairs and tailored evaluation rubrics, is manually reviewed and corrected by human validators, ensuring a clean, reliable benchmark.
  • Figure 2: LongShOTAgent Pipeline. The orchestrator agent (Qwen3-4B) receives a user query and video input, then calls the Preprocessor to extract multimodal signals, including Whisper-small speech transcription, scene-based frame sampling, SigLIP embeddings, OCR, and audio analysis. These features populate a vector database, which the Search tool queries to retrieve top-k relevant segments via semantic similarity. For deeper analysis, the orchestrator invokes Refiner tools such as Whisper-large-v3 for high-quality speech transcription, Audio-Flamingo-3 for detailed audio understanding, and a Video Refiner for dense caption generation. Beyond these core modules, LongShOTAgent can access external tools including activity detection, web search, and other APIs to expand reasoning and retrieve additional context when needed. By flexibly sequencing preprocessing, search, refinement, and external tool calls, the orchestrator integrates multimodal evidence and auxiliary knowledge to generate a coherent final answer, demonstrating adaptive and agentic coordination across heterogeneous capabilities.
  • Figure 3: Distribution of video durations (minutes) in our validated sample set ($n=157$).
  • Figure 4: Video category distribution. Each video may belong to multiple categories.
  • Figure 5: Evaluation Examples of LongShOTBench. The data samples illustrate how we construct scenario context, model a user’s thought process, generate diverse questions (single- and multi-turn), and apply criterion-weighted evaluation rubrics for interpretable scoring.