Table of Contents
Fetching ...

Enhancing Video-LLM Reasoning via Agent-of-Thoughts Distillation

Yudi Shi, Shangzhe Di, Qirui Chen, Weidi Xie

TL;DR

The paper tackles VideoQA by integrating automated Chain-of-Thoughts into Video-LLM training through Agent-of-Thoughts Distillation (AoTD). It builds an agent-based system to decompose complex questions, solves sub-tasks with specialized vision models, and automatically generates CoTs that are verified by an LLM before distilling the reasoning into a Video-LLM. Key contributions include a formal problem formulation with a CoT-aware loss, a practical pipeline for CoT construction and verification, and empirical demonstrations showing improved performance on both multiple-choice and open-ended benchmarks, along with analyses of rationales and transferability. The approach enhances interpretability and accuracy, offering a scalable way to imbue Video-LLMs with structured multi-step reasoning for complex spatial-temporal tasks.

Abstract

This paper tackles the problem of video question answering (VideoQA), a task that often requires multi-step reasoning and a profound understanding of spatial-temporal dynamics. While large video-language models perform well on benchmarks, they often lack explainability and spatial-temporal grounding. In this paper, we propose Agent-of-Thoughts Distillation (AoTD), a method that enhances models by incorporating automatically generated Chain-of-Thoughts (CoTs) into the instruction-tuning process. Specifically, we leverage an agent-based system to decompose complex questions into sub-tasks, and address them with specialized vision models, the intermediate results are then treated as reasoning chains. We also introduce a verification mechanism using a large language model (LLM) to ensure the reliability of generated CoTs. Extensive experiments demonstrate that AoTD improves the performance on multiple-choice and open-ended benchmarks.

Enhancing Video-LLM Reasoning via Agent-of-Thoughts Distillation

TL;DR

The paper tackles VideoQA by integrating automated Chain-of-Thoughts into Video-LLM training through Agent-of-Thoughts Distillation (AoTD). It builds an agent-based system to decompose complex questions, solves sub-tasks with specialized vision models, and automatically generates CoTs that are verified by an LLM before distilling the reasoning into a Video-LLM. Key contributions include a formal problem formulation with a CoT-aware loss, a practical pipeline for CoT construction and verification, and empirical demonstrations showing improved performance on both multiple-choice and open-ended benchmarks, along with analyses of rationales and transferability. The approach enhances interpretability and accuracy, offering a scalable way to imbue Video-LLMs with structured multi-step reasoning for complex spatial-temporal tasks.

Abstract

This paper tackles the problem of video question answering (VideoQA), a task that often requires multi-step reasoning and a profound understanding of spatial-temporal dynamics. While large video-language models perform well on benchmarks, they often lack explainability and spatial-temporal grounding. In this paper, we propose Agent-of-Thoughts Distillation (AoTD), a method that enhances models by incorporating automatically generated Chain-of-Thoughts (CoTs) into the instruction-tuning process. Specifically, we leverage an agent-based system to decompose complex questions into sub-tasks, and address them with specialized vision models, the intermediate results are then treated as reasoning chains. We also introduce a verification mechanism using a large language model (LLM) to ensure the reliability of generated CoTs. Extensive experiments demonstrate that AoTD improves the performance on multiple-choice and open-ended benchmarks.

Paper Structure

This paper contains 23 sections, 3 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Our method, AoTD, distills multi-step reasoning and spatial-temporal understanding into a single generative video-language model. When addressing complex VideoQA tasks, the model trained with AoTD (as shown in (b)) enables to generate a step-by-step reasoning to get the correct answer. In contrast, previous models trained solely on question-answer pairs (as in (a)) generate only a final answer, often without intermediate reasoning, which can lead to incorrect conclusions.
  • Figure 2: Overview on Agent-of-Thoughts Distillation (AoTD). Step 1: Selecting best-performing agents for each sub-task to construct an agent-based system. Step 2: Decomposing question into executable program and leveraging chosen models to solve it sequentially to generate execution trace. Step 3: The execution trace is converted and filtered by LLM to produce high quality natural language CoTs. Step 4: Distilling CoTs into Video-LLM with two forms of prompt, allowing it achieve a balance between concise answers and comprehensive rationales. The final model is Video-LLM-AoTD.
  • Figure 3: Program execution process in an agent-based system. We uniformly sample 32 frames from the video, and to ensure scale consistency, the frame ids of key frames are normalized into these 32 frames. The blue boxes represent the program execution steps, the red boxes denote the ground truth for each step. The combination of red and yellow boxes represents one example process of evaluating Object detection model candidates.
  • Figure 4:
  • Figure 5: Example form NExT-QA xiao2021next
  • ...and 1 more figures