Table of Contents
Fetching ...

The Solution for the ICCV 2023 Perception Test Challenge 2023 -- Task 6 -- Grounded videoQA

Hailiang Zhang, Dian Chao, Zhihao Guan, Yang Yang

TL;DR

Grounded video QA requires both answering questions about a video and localizing the corresponding target object across frames. The authors propose a two-stage pipeline that first uses the VALOR model to answer questions from video content, then concatenates the question with its answer and uses TubeDETR to produce bounding boxes, with frame sampling to handle long videos. Key contributions include showing that VALOR+TubeDETR can outperform the official baseline, employing a frame-rate sampling strategy for long sequences, and applying EMA-based training stabilization with careful pretraining initialization. This approach offers a scalable, practical framework for grounded video QA in untrimmed video settings. Overall, the method demonstrates that QA-guided grounding can effectively locate targets across time, enabling robust video understanding in challenging benchmarks.

Abstract

In this paper, we introduce a grounded video question-answering solution. Our research reveals that the fixed official baseline method for video question answering involves two main steps: visual grounding and object tracking. However, a significant challenge emerges during the initial step, where selected frames may lack clearly identifiable target objects. Furthermore, single images cannot address questions like "Track the container from which the person pours the first time." To tackle this issue, we propose an alternative two-stage approach:(1) First, we leverage the VALOR model to answer questions based on video information.(2) concatenate the answered questions with their respective answers. Finally, we employ TubeDETR to generate bounding boxes for the targets.

The Solution for the ICCV 2023 Perception Test Challenge 2023 -- Task 6 -- Grounded videoQA

TL;DR

Grounded video QA requires both answering questions about a video and localizing the corresponding target object across frames. The authors propose a two-stage pipeline that first uses the VALOR model to answer questions from video content, then concatenates the question with its answer and uses TubeDETR to produce bounding boxes, with frame sampling to handle long videos. Key contributions include showing that VALOR+TubeDETR can outperform the official baseline, employing a frame-rate sampling strategy for long sequences, and applying EMA-based training stabilization with careful pretraining initialization. This approach offers a scalable, practical framework for grounded video QA in untrimmed video settings. Overall, the method demonstrates that QA-guided grounding can effectively locate targets across time, enabling robust video understanding in challenging benchmarks.

Abstract

In this paper, we introduce a grounded video question-answering solution. Our research reveals that the fixed official baseline method for video question answering involves two main steps: visual grounding and object tracking. However, a significant challenge emerges during the initial step, where selected frames may lack clearly identifiable target objects. Furthermore, single images cannot address questions like "Track the container from which the person pours the first time." To tackle this issue, we propose an alternative two-stage approach:(1) First, we leverage the VALOR model to answer questions based on video information.(2) concatenate the answered questions with their respective answers. Finally, we employ TubeDETR to generate bounding boxes for the targets.
Paper Structure (10 sections, 1 equation, 3 figures, 1 table)

This paper contains 10 sections, 1 equation, 3 figures, 1 table.

Figures (3)

  • Figure 1: Grounded videoQA needs to answer the question and locate the answer.
  • Figure 2: The Overall Architecture of Our Method Based on VALOR and TubeDETR.
  • Figure 3: TubeDETR model overview.