HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision
Siddhant Bansal, Michael Wray, Dima Damen
TL;DR
This work introduces HOI-Ref, a task for egocentric vision that enables referral of hands, objects, and their interactions. It provides HOI-QA, a large-scale dataset with 3.9M QA pairs derived from EPIC-Kitchens and Ego4D to train and evaluate VLMs on hand-object referral and interaction understanding. The authors propose VLM4HOI, a unified model that fuses a frozen vision encoder with an LLM via a projection layer and task-aware prompts, achieving substantial gains over baselines and demonstrating the necessity of egocentric-specific data for hand-object reasoning. The dataset, model, and code are released to advance research in HOI-Ref for egocentric vision and enable practical applications in AR/robotics.
Abstract
Large Vision Language Models (VLMs) are now the de facto state-of-the-art for a number of tasks including visual question answering, recognising objects, and spatial referral. In this work, we propose the HOI-Ref task for egocentric images that aims to understand interactions between hands and objects using VLMs. To enable HOI-Ref, we curate the HOI-QA dataset that consists of 3.9M question-answer pairs for training and evaluating VLMs. HOI-QA includes questions relating to locating hands, objects, and critically their interactions (e.g. referring to the object being manipulated by the hand). We train the first VLM for HOI-Ref on this dataset and call it VLM4HOI. Our results demonstrate that VLMs trained for referral on third person images fail to recognise and refer hands and objects in egocentric images. When fine-tuned on our egocentric HOI-QA dataset, performance improves by 27.9% for referring hands and objects, and by 26.7% for referring interactions.
