Table of Contents
Fetching ...

HanDyVQA: A Video QA Benchmark for Fine-Grained Hand-Object Interaction Dynamics

Masatoshi Tateno, Gido Kato, Hirokatsu Kataoka, Yoichi Sato, Takuma Yagi

TL;DR

HanDyVQA introduces a fine-grained HOI video QA benchmark that jointly evaluates manipulation and effect dynamics through MCQ and ReasoningVOS tasks across six question types. By grounding questions in Ego4D and enforcing thorough human verification, the dataset reveals substantial gaps in current video-language models’ ability to capture spatiotemporal HOI cues and part-level reasoning. The paper shows that increasing frame count, resolution, and explicit HOI cues (hand pose, object tracking, and object features) improves performance but still leaves a large gap to human accuracy, especially for motion and part-level grounding. These findings suggest that future models must incorporate richer temporal reasoning, precise local hand–object interactions, and component-level grounding to robustly model HOI dynamics in real-world videos.

Abstract

Hand-object interaction (HOI) inherently involves dynamics where human manipulations produce distinct spatio-temporal effects on objects. However, existing semantic HOI benchmarks focused either on manipulation or on the resulting effects at a coarse level, lacking fine-grained spatio-temporal reasoning to capture the underlying dynamics in HOI. We introduce HanDyVQA, a fine-grained video question-answering benchmark that comprehensively covers both the manipulation and effect aspects of HOI. HanDyVQA comprises six complementary question types (Action, Process, Objects, Location, State Change, and Object Parts), totalling 11.1K multiple-choice QA pairs. Collected QA pairs recognizing manipulation styles, hand/object motions, and part-level state changes. HanDyVQA also includes 10.3K segmentation masks for Objects and Object Parts questions, enabling the evaluation of object/part-level reasoning in video object segmentation. We evaluated recent video foundation models on our benchmark and found that even the best-performing model, Gemini-2.5-Pro, reached only 73% average accuracy, which is far from human performance (97%). Further analysis shows the remaining challenges in spatial relationship, motion, and part-level geometric understanding. We also found that integrating explicit HOI-related cues into visual features improves performance, offering insights for developing future models with a deeper understanding of HOI dynamics.

HanDyVQA: A Video QA Benchmark for Fine-Grained Hand-Object Interaction Dynamics

TL;DR

HanDyVQA introduces a fine-grained HOI video QA benchmark that jointly evaluates manipulation and effect dynamics through MCQ and ReasoningVOS tasks across six question types. By grounding questions in Ego4D and enforcing thorough human verification, the dataset reveals substantial gaps in current video-language models’ ability to capture spatiotemporal HOI cues and part-level reasoning. The paper shows that increasing frame count, resolution, and explicit HOI cues (hand pose, object tracking, and object features) improves performance but still leaves a large gap to human accuracy, especially for motion and part-level grounding. These findings suggest that future models must incorporate richer temporal reasoning, precise local hand–object interactions, and component-level grounding to robustly model HOI dynamics in real-world videos.

Abstract

Hand-object interaction (HOI) inherently involves dynamics where human manipulations produce distinct spatio-temporal effects on objects. However, existing semantic HOI benchmarks focused either on manipulation or on the resulting effects at a coarse level, lacking fine-grained spatio-temporal reasoning to capture the underlying dynamics in HOI. We introduce HanDyVQA, a fine-grained video question-answering benchmark that comprehensively covers both the manipulation and effect aspects of HOI. HanDyVQA comprises six complementary question types (Action, Process, Objects, Location, State Change, and Object Parts), totalling 11.1K multiple-choice QA pairs. Collected QA pairs recognizing manipulation styles, hand/object motions, and part-level state changes. HanDyVQA also includes 10.3K segmentation masks for Objects and Object Parts questions, enabling the evaluation of object/part-level reasoning in video object segmentation. We evaluated recent video foundation models on our benchmark and found that even the best-performing model, Gemini-2.5-Pro, reached only 73% average accuracy, which is far from human performance (97%). Further analysis shows the remaining challenges in spatial relationship, motion, and part-level geometric understanding. We also found that integrating explicit HOI-related cues into visual features improves performance, offering insights for developing future models with a deeper understanding of HOI dynamics.

Paper Structure

This paper contains 71 sections, 2 equations, 12 figures, 32 tables, 1 algorithm.

Figures (12)

  • Figure 1: Overview of HanDyVQA Dataset. HanDyVQA evaluates fine-grained hand–object interaction (HOI) dynamics for both manipulation and effect aspects through MCQ and ReasoningVOS tasks. MCQ only shows a subset of answer candidates.
  • Figure 2: Scenario distribution of HanDyVQA dataset (left), number of questions per question type (top right), and average number of words per option (bottom right).
  • Figure 3: Qualitative results. Sentence with green highlights denote correct answer. Colored circles denote options predicted by each model.
  • Figure 4: Ablation and error analyses over input frames from 1 ($\approx$0.2 fps) to 64 ($\approx$12.8 fps) and resolutions.
  • Figure 5: Qualitative results on ReasoningVOS. GT means ground-truth. GT masks are shown in green in first row, while predictions of each model are shown in magenta. Models mistakenly segment entire object instead of correct part. Note that video models are provided with 16 frames during inference for temporal context.
  • ...and 7 more figures