Table of Contents
Fetching ...

FingerCap: Fine-grained Finger-level Hand Motion Captioning

Xin Shen, Rui Zhu, Lei Shen, Xinyu Wang, Kaihao Zhang, Tianqing Zhu, Shuchen Wu, Chenxi Miao, Weikang Li, Yang Li, Deguo Xia, Jizhou Huang, Xin Yu

TL;DR

FingerCap tackles the challenge of describing fine-grained finger-level hand motions by introducing FingerCap-40K, a large dataset spanning gesture and hand-object interactions, and a benchmark evaluation via HandJudge. The authors propose FiGOP, a compute-efficient unit that pairs sparse RGB keyframes with dense hand-keypoint sequences and fuses them through a motion-aware projector to enable accurate finger-level reasoning in Video-MLLMs. Through extensive benchmarks against open- and closed-source models, FiGOP consistently improves finger articulation accuracy and motion completeness, even under distribution shifts, as measured by HandJudge and human studies. This work highlights a fundamental gap in current video-language models' ability to capture high-frequency finger dynamics and provides a practical path to bridging perception and language for dexterous manipulation and sign-language understanding.

Abstract

Understanding fine-grained human hand motion is fundamental to visual perception, embodied intelligence, and multimodal communication. In this work, we propose Fine-grained Finger-level Hand Motion Captioning (FingerCap), which aims to generate textual descriptions that capture detailed finger-level semantics of hand actions. To support this task, we curate FingerCap-40K, a large-scale corpus of 40K paired hand-motion videos and captions spanning two complementary sources: concise instruction-style finger motions and diverse, naturalistic hand-object interactions. To enable effective evaluation, we employ HandJudge, a LLM-based rubric that measures finger-level correctness and motion completeness. Temporal sparsity remains a fundamental bottleneck for current Video-MLLMs, since sparse RGB sampling is insufficient to capture the subtle, high-frequency dynamics underlying fine finger motions. As a simple and compute-friendly remedy, we introduce FiGOP (Finger Group-of-Pictures), which pairs each RGB keyframe with subsequent hand keypoints until the next keyframe. A lightweight temporal encoder converts the keypoints into motion embeddings and integrates them with RGB features. FiGOP adapts the classic GOP concept to finger motion, recovering fine temporal cues without increasing RGB density. Experiments on FingerCap-40K show that strong open- and closed-source Video-MLLMs still struggle with finger-level reasoning, while our FiGOP-augmented model yield consistent gains under HandJudge and human studies.

FingerCap: Fine-grained Finger-level Hand Motion Captioning

TL;DR

FingerCap tackles the challenge of describing fine-grained finger-level hand motions by introducing FingerCap-40K, a large dataset spanning gesture and hand-object interactions, and a benchmark evaluation via HandJudge. The authors propose FiGOP, a compute-efficient unit that pairs sparse RGB keyframes with dense hand-keypoint sequences and fuses them through a motion-aware projector to enable accurate finger-level reasoning in Video-MLLMs. Through extensive benchmarks against open- and closed-source models, FiGOP consistently improves finger articulation accuracy and motion completeness, even under distribution shifts, as measured by HandJudge and human studies. This work highlights a fundamental gap in current video-language models' ability to capture high-frequency finger dynamics and provides a practical path to bridging perception and language for dexterous manipulation and sign-language understanding.

Abstract

Understanding fine-grained human hand motion is fundamental to visual perception, embodied intelligence, and multimodal communication. In this work, we propose Fine-grained Finger-level Hand Motion Captioning (FingerCap), which aims to generate textual descriptions that capture detailed finger-level semantics of hand actions. To support this task, we curate FingerCap-40K, a large-scale corpus of 40K paired hand-motion videos and captions spanning two complementary sources: concise instruction-style finger motions and diverse, naturalistic hand-object interactions. To enable effective evaluation, we employ HandJudge, a LLM-based rubric that measures finger-level correctness and motion completeness. Temporal sparsity remains a fundamental bottleneck for current Video-MLLMs, since sparse RGB sampling is insufficient to capture the subtle, high-frequency dynamics underlying fine finger motions. As a simple and compute-friendly remedy, we introduce FiGOP (Finger Group-of-Pictures), which pairs each RGB keyframe with subsequent hand keypoints until the next keyframe. A lightweight temporal encoder converts the keypoints into motion embeddings and integrates them with RGB features. FiGOP adapts the classic GOP concept to finger motion, recovering fine temporal cues without increasing RGB density. Experiments on FingerCap-40K show that strong open- and closed-source Video-MLLMs still struggle with finger-level reasoning, while our FiGOP-augmented model yield consistent gains under HandJudge and human studies.

Paper Structure

This paper contains 42 sections, 4 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: FingerCap aims to generate textual descriptions that capture detailed finger-level semantics of hand actions. Examples from FingerCap-40K: top, concise instruction-style clips with explicit targets for finger articulation; bottom, hand--object interactions showing coordinated finger dynamics during manipulation.
  • Figure 2: Data collection, annotation and processing pipeline for gesture and hand–object interaction data in FingerCap-40K. Gesture videos are collected from multilingual sign language datasets, where raw dictionary-style motion descriptions are manually corrected and refined using an LLM to produce finger-level captions. Hand–object interaction videos are sampled from multi-view manipulation datasets, in which the clearest view is selected, followed by human-written and LLM-refined finger–object interaction descriptions.
  • Figure 3: Statistics of the FingerCap-40K dataset across gesture and hand–object interaction domains, summarizing video and text scale, frame density, vocabulary coverage, camera diversity, hand-use distribution, and OOD subsets information.
  • Figure 4: Data distribution in FingerCap-40K. Top: the word cloud of finger- and hand-related terms in captions. Bottom: (left) video duration; (middle) single vs. double hand usage across viewpoints; and (right) caption length distribution.
  • Figure 5: Overview of the FiGOP-augmented Video-MLLM.
  • ...and 6 more figures