GazeGPT: Augmenting Human Capabilities using Gaze-contingent Contextual AI for Smart Eyewear
Robert Konrad, Nitish Padmanaban, J. Gabriel Buckmaster, Kevin C. Boyle, Gordon Wetzstein
TL;DR
GazeGPT advances wearable AI by aligning a gaze-tracking gaze-contingent input with a world-facing camera to provide targeted multimodal context to a large multimodal model. The approach crops around the user’s gaze at multiple scales and feeds these context-rich crops to GPT-4V, enabling improved object understanding and task performance. Across selection-speed/accuracy, augmentation of dog-breed classification, and user-preference studies, gaze-based selection outperforms body- and head-based modes in accuracy and speed, and is consistently viewed as more natural. The work demonstrates near-AI-level performance gains for humans on complex visual tasks and discusses practical deployment considerations, including hardware quality and latency, with clear avenues for real-world application in on-the-go personal assistants.
Abstract
Multimodal large language models (LMMs) excel in world knowledge and problem-solving abilities. Through the use of a world-facing camera and contextual AI, emerging smart accessories aim to provide a seamless interface between humans and LMMs. Yet, these wearable computing systems lack an understanding of the user's attention. We introduce GazeGPT as a new user interaction paradigm for contextual AI. GazeGPT uses eye tracking to help the LMM understand which object in the world-facing camera view a user is paying attention to. Using extensive user evaluations, we show that this gaze-contingent mechanism is a faster and more accurate pointing mechanism than alternatives; that it augments human capabilities by significantly improving their accuracy in a dog-breed classification task; and that it is consistently ranked as more natural than head- or body-driven selection mechanisms for contextual AI. Moreover, we prototype a variety of application scenarios that suggest GazeGPT could be of significant value to users as part of future AI-driven personal assistants.
