Table of Contents
Fetching ...

HandVQA: Diagnosing and Improving Fine-Grained Spatial Reasoning about Hands in Vision-Language Models

MD Khalequzzaman Chowdhury Sayem, Mubarrat Tajoar Chowdhury, Yihalem Yimolal Tiruneh, Muneeb A. Khan, Muhammad Salman Ali, Binod Bhattarai, Seungryul Baek

Abstract

Understanding the fine-grained articulation of human hands is critical in high-stakes settings such as robot-assisted surgery, chip manufacturing, and AR/VR-based human-AI interaction. Despite achieving near-human performance on general vision-language benchmarks, current vision-language models (VLMs) struggle with fine-grained spatial reasoning, especially in interpreting complex and articulated hand poses. We introduce HandVQA, a large-scale diagnostic benchmark designed to evaluate VLMs' understanding of detailed hand anatomy through visual question answering. Built upon high-quality 3D hand datasets (FreiHAND, InterHand2.6M, FPHA), our benchmark includes over 1.6M controlled multiple-choice questions that probe spatial relationships between hand joints, such as angles, distances, and relative positions. We evaluate several state-of-the-art VLMs (LLaVA, DeepSeek and Qwen-VL) in both base and fine-tuned settings, using lightweight fine-tuning via LoRA. Our findings reveal systematic limitations in current models, including hallucinated finger parts, incorrect geometric interpretations, and poor generalization. HandVQA not only exposes these critical reasoning gaps but provides a validated path to improvement. We demonstrate that the 3D-grounded spatial knowledge learned from our benchmark transfers in a zero-shot setting, significantly improving accuracy of model on novel downstream tasks like hand gesture recognition (+10.33%) and hand-object interaction (+2.63%).

HandVQA: Diagnosing and Improving Fine-Grained Spatial Reasoning about Hands in Vision-Language Models

Abstract

Understanding the fine-grained articulation of human hands is critical in high-stakes settings such as robot-assisted surgery, chip manufacturing, and AR/VR-based human-AI interaction. Despite achieving near-human performance on general vision-language benchmarks, current vision-language models (VLMs) struggle with fine-grained spatial reasoning, especially in interpreting complex and articulated hand poses. We introduce HandVQA, a large-scale diagnostic benchmark designed to evaluate VLMs' understanding of detailed hand anatomy through visual question answering. Built upon high-quality 3D hand datasets (FreiHAND, InterHand2.6M, FPHA), our benchmark includes over 1.6M controlled multiple-choice questions that probe spatial relationships between hand joints, such as angles, distances, and relative positions. We evaluate several state-of-the-art VLMs (LLaVA, DeepSeek and Qwen-VL) in both base and fine-tuned settings, using lightweight fine-tuning via LoRA. Our findings reveal systematic limitations in current models, including hallucinated finger parts, incorrect geometric interpretations, and poor generalization. HandVQA not only exposes these critical reasoning gaps but provides a validated path to improvement. We demonstrate that the 3D-grounded spatial knowledge learned from our benchmark transfers in a zero-shot setting, significantly improving accuracy of model on novel downstream tasks like hand gesture recognition (+10.33%) and hand-object interaction (+2.63%).

Paper Structure

This paper contains 42 sections, 17 equations, 24 figures, 14 tables.

Figures (24)

  • Figure 1: Overview of HandVQA’s transfer effect. Fine-tuning a base Vision-Language Model (VLM) on HandVQA teaches it explicit 3D hand geometry and joint-level spatial reasoning. The resulting Spatial-Aware VLM exhibits zero-shot generalization to novel downstream tasks: both image-based gesture recognition and video-based hand-object interaction recognition. Spatial-Aware VLM achieves consistent accuracy gains without task-specific training.
  • Figure 2: Overview of HandVQA Question Format. This figure illustrates the structure of our benchmark, which divides hand pose estimation into five sub-tasks: Angle, Distance, and Relative Position along X, Y, and Z axes. A hand image with annotated joint indices (top left) supports multiple-choice questions per task type, derived from 3D joint coordinates and the correct answers are shown in green.
  • Figure 3: Overview of the HandVQA pipeline. The pipeline converts normalized 3D hand joints into interpretable VQA pairs through three deterministic stages: (1) $\mathcal{F}_{\text{pose}}$ computes continuous pose descriptors: angles ($\theta$), distances ($d$), and relative positions ($\Delta_x, \Delta_y, \Delta_z$) and categorizes them into discrete pose descriptors ($\Gamma$); (2) $\mathcal{F}_{\text{text}}$ fills deterministic sentence templates using $\Gamma$ and filters correct and incorrect options to form candidate answer options $(\mathcal{O})$ and the correct label $y^\star$; (3) $\mathcal{F}_{\text{mcq}}$ constructs multiple-choice questions (MCQ) by pairing each image with $\mathcal{O}$ and its correct label $y^\star$.
  • Figure 4: The map of the hand skeleton used in our HandVQA benchmark generation pipeline.
  • Figure 5: Possible location of the 'aligned' Little Finger Proximal Interphalangeal (PIP) joint and Ring Finger Proximal Interphalangeal (PIP) joint underneath the index and middle finger. The relationship along the x-axis for the two PIP joints is ambiguous, making it necessary to drop the relative position X information of the two joints.
  • ...and 19 more figures