Table of Contents
Fetching ...

HandsOnVLM: Vision-Language Models for Hand-Object Interaction Prediction

Chen Bao, Jiarui Xu, Xiaolong Wang, Abhinav Gupta, Homanga Bharadhwaj

TL;DR

HandsOnVLM tackles predicting future egocentric hand trajectories conditioned on natural-language prompts. It leverages a SlowFast-based visual encoder, extended <HAND> token, and autoregressive decoding within a pre-trained Vision-Language Model, augmented with a CVAE-based hand trajectory decoder. It introduces Vanilla Hand Prediction and Reasoning-Based Hand Prediction tasks and corresponding benchmarks, and demonstrates strong generalization and reasoning, including zero-shot performance on unseen datasets. This work bridges high-level language-based reasoning with low-level hand action prediction, with potential impact on autonomous manipulation and human-robot interaction.

Abstract

How can we predict future interaction trajectories of human hands in a scene given high-level colloquial task specifications in the form of natural language? In this paper, we extend the classic hand trajectory prediction task to two tasks involving explicit or implicit language queries. Our proposed tasks require extensive understanding of human daily activities and reasoning abilities about what should be happening next given cues from the current scene. We also develop new benchmarks to evaluate the proposed two tasks, Vanilla Hand Prediction (VHP) and Reasoning-Based Hand Prediction (RBHP). We enable solving these tasks by integrating high-level world knowledge and reasoning capabilities of Vision-Language Models (VLMs) with the auto-regressive nature of low-level ego-centric hand trajectories. Our model, HandsOnVLM is a novel VLM that can generate textual responses and produce future hand trajectories through natural-language conversations. Our experiments show that HandsOnVLM outperforms existing task-specific methods and other VLM baselines on proposed tasks, and demonstrates its ability to effectively utilize world knowledge for reasoning about low-level human hand trajectories based on the provided context. Our website contains code and detailed video results https://www.chenbao.tech/handsonvlm/

HandsOnVLM: Vision-Language Models for Hand-Object Interaction Prediction

TL;DR

HandsOnVLM tackles predicting future egocentric hand trajectories conditioned on natural-language prompts. It leverages a SlowFast-based visual encoder, extended <HAND> token, and autoregressive decoding within a pre-trained Vision-Language Model, augmented with a CVAE-based hand trajectory decoder. It introduces Vanilla Hand Prediction and Reasoning-Based Hand Prediction tasks and corresponding benchmarks, and demonstrates strong generalization and reasoning, including zero-shot performance on unseen datasets. This work bridges high-level language-based reasoning with low-level hand action prediction, with potential impact on autonomous manipulation and human-robot interaction.

Abstract

How can we predict future interaction trajectories of human hands in a scene given high-level colloquial task specifications in the form of natural language? In this paper, we extend the classic hand trajectory prediction task to two tasks involving explicit or implicit language queries. Our proposed tasks require extensive understanding of human daily activities and reasoning abilities about what should be happening next given cues from the current scene. We also develop new benchmarks to evaluate the proposed two tasks, Vanilla Hand Prediction (VHP) and Reasoning-Based Hand Prediction (RBHP). We enable solving these tasks by integrating high-level world knowledge and reasoning capabilities of Vision-Language Models (VLMs) with the auto-regressive nature of low-level ego-centric hand trajectories. Our model, HandsOnVLM is a novel VLM that can generate textual responses and produce future hand trajectories through natural-language conversations. Our experiments show that HandsOnVLM outperforms existing task-specific methods and other VLM baselines on proposed tasks, and demonstrates its ability to effectively utilize world knowledge for reasoning about low-level human hand trajectories based on the provided context. Our website contains code and detailed video results https://www.chenbao.tech/handsonvlm/

Paper Structure

This paper contains 24 sections, 2 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: HandsOnVLM forecasts low-level actions in the form of hand trajectories in the user's egocentric view of a scene when queried with a question via natural language.
  • Figure 2: Overview of the HandsOnVLM architecture, where and denote trainable and frozen modules separately. HandsOnVLM casts hand trajectory prediction as an auto-regressive next token prediction conditioned on fused video and language tokens. The architecture augments a pre-trained VLM with an additional hand token in the vocabulary. We use and to represent text and <HAND> tokens respectively.
  • Figure 3: Illustration of the annotation pipeline for the RBHP task. By using GPT-4 on human video datasets we extract implicit language instructions for visual question-answering. The red and blue lines respectively show trajectories for the right and left hands.
  • Figure 4: Qualitative results for different samples from the validation split of our RBHP dataset (top in blue) and zero-shot evaluations on completely unseen datasets FPHA and H2O (bottom in pink). The left-hand trajectory is visualized in blue and the right-hand trajectory is in red. The arrows denote the direction of each trajectory. GT trajectories are provided for reference.
  • Figure 5: Illustration of training pipeline.
  • ...and 5 more figures