Table of Contents
Fetching ...

HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision

Siddhant Bansal, Michael Wray, Dima Damen

TL;DR

This work introduces HOI-Ref, a task for egocentric vision that enables referral of hands, objects, and their interactions. It provides HOI-QA, a large-scale dataset with 3.9M QA pairs derived from EPIC-Kitchens and Ego4D to train and evaluate VLMs on hand-object referral and interaction understanding. The authors propose VLM4HOI, a unified model that fuses a frozen vision encoder with an LLM via a projection layer and task-aware prompts, achieving substantial gains over baselines and demonstrating the necessity of egocentric-specific data for hand-object reasoning. The dataset, model, and code are released to advance research in HOI-Ref for egocentric vision and enable practical applications in AR/robotics.

Abstract

Large Vision Language Models (VLMs) are now the de facto state-of-the-art for a number of tasks including visual question answering, recognising objects, and spatial referral. In this work, we propose the HOI-Ref task for egocentric images that aims to understand interactions between hands and objects using VLMs. To enable HOI-Ref, we curate the HOI-QA dataset that consists of 3.9M question-answer pairs for training and evaluating VLMs. HOI-QA includes questions relating to locating hands, objects, and critically their interactions (e.g. referring to the object being manipulated by the hand). We train the first VLM for HOI-Ref on this dataset and call it VLM4HOI. Our results demonstrate that VLMs trained for referral on third person images fail to recognise and refer hands and objects in egocentric images. When fine-tuned on our egocentric HOI-QA dataset, performance improves by 27.9% for referring hands and objects, and by 26.7% for referring interactions.

HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision

TL;DR

This work introduces HOI-Ref, a task for egocentric vision that enables referral of hands, objects, and their interactions. It provides HOI-QA, a large-scale dataset with 3.9M QA pairs derived from EPIC-Kitchens and Ego4D to train and evaluate VLMs on hand-object referral and interaction understanding. The authors propose VLM4HOI, a unified model that fuses a frozen vision encoder with an LLM via a projection layer and task-aware prompts, achieving substantial gains over baselines and demonstrating the necessity of egocentric-specific data for hand-object reasoning. The dataset, model, and code are released to advance research in HOI-Ref for egocentric vision and enable practical applications in AR/robotics.

Abstract

Large Vision Language Models (VLMs) are now the de facto state-of-the-art for a number of tasks including visual question answering, recognising objects, and spatial referral. In this work, we propose the HOI-Ref task for egocentric images that aims to understand interactions between hands and objects using VLMs. To enable HOI-Ref, we curate the HOI-QA dataset that consists of 3.9M question-answer pairs for training and evaluating VLMs. HOI-QA includes questions relating to locating hands, objects, and critically their interactions (e.g. referring to the object being manipulated by the hand). We train the first VLM for HOI-Ref on this dataset and call it VLM4HOI. Our results demonstrate that VLMs trained for referral on third person images fail to recognise and refer hands and objects in egocentric images. When fine-tuned on our egocentric HOI-QA dataset, performance improves by 27.9% for referring hands and objects, and by 26.7% for referring interactions.
Paper Structure (42 sections, 2 equations, 11 figures, 9 tables)

This paper contains 42 sections, 2 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: Hand-Object Interaction Referral. Given an image from an egocentric video, the goal here is to refer the hands and the objects being interacted with. For example, here we wish to refer the left and right hand along with the two objects (jar and lid) that the hands are interacting with.
  • Figure 2: (a) VLM4HOI for hand-object interaction referral in egocentric images. The VLM4HOI model takes in an image ($I$), passes it through a vision encoder ($g$) and a projection layer ($W_\phi$) to obtain embeddings ($E_p$) in language model's ($f_\theta$) embedding space. This is concatenated with the tokenised text ($E_L$) and passed through $f_\theta$ to generate a language response ($E_a$). We show two examples where based on the task instruction template, the model generates an output. (b), the model identifies a bounding box input as the right hand. (c), the model takes in the image and a question to refer the object being held in the right hand and outputs a bounding box.
  • Figure 3: Question-Answer pairs generation for training VLMs to understand hand-object interaction. We use multiple annotation types to create the question-answer pairs. Top shows the annotations utilised and Bottom shows the types of question-answer pairs generated from these annotations. As shown, we convert the segments to bounding boxes to generate various referral questions and utilise contact information to understand interaction between hands and objects. Right shows the distribution of questions in the proposed HOI-QA dataset (\ref{['subsec:dataset']}).
  • Figure 4: HOI-Ref task to train and evaluate VLMs for hand-object interaction referral. HOI-Ref focuses on the following two aspects: ability to spatially refer and recognise hands and objects and the capability to understand hand-object interaction. Columns (1) and (2) evaluate spatially referring hands and objects whereas, columns (3) and (4) aim at object and hand side recognition. Moving across rows (A) and (B) shows HOI-Ref's ability to evaluate for direct referral vs interaction referral. For example, in A-1, referring a bottle is simply asking where is the bottle however, for B-1, it involves knowing which hand is holding the bottle.
  • Figure 5: Qualitative Results on VLM4HOI and MiniGPT-v2chen2023minigptv2 on HOI-QA. For questions with correct bounding box output, the ground truth bounding box is omitted. When both models are incorrect, we add the ground truth in blue. VLM4HOI performs well on most of the cases where MiniGPT-v2 falls short. VLM4HOI fails in case of ambiguity. For example, it identifies the hand as cloth as the hand is holding the cloth (MiniGPT-v2 predicts it as a waffle).
  • ...and 6 more figures