Table of Contents
Fetching ...

GazeGPT: Augmenting Human Capabilities using Gaze-contingent Contextual AI for Smart Eyewear

Robert Konrad, Nitish Padmanaban, J. Gabriel Buckmaster, Kevin C. Boyle, Gordon Wetzstein

TL;DR

GazeGPT advances wearable AI by aligning a gaze-tracking gaze-contingent input with a world-facing camera to provide targeted multimodal context to a large multimodal model. The approach crops around the user’s gaze at multiple scales and feeds these context-rich crops to GPT-4V, enabling improved object understanding and task performance. Across selection-speed/accuracy, augmentation of dog-breed classification, and user-preference studies, gaze-based selection outperforms body- and head-based modes in accuracy and speed, and is consistently viewed as more natural. The work demonstrates near-AI-level performance gains for humans on complex visual tasks and discusses practical deployment considerations, including hardware quality and latency, with clear avenues for real-world application in on-the-go personal assistants.

Abstract

Multimodal large language models (LMMs) excel in world knowledge and problem-solving abilities. Through the use of a world-facing camera and contextual AI, emerging smart accessories aim to provide a seamless interface between humans and LMMs. Yet, these wearable computing systems lack an understanding of the user's attention. We introduce GazeGPT as a new user interaction paradigm for contextual AI. GazeGPT uses eye tracking to help the LMM understand which object in the world-facing camera view a user is paying attention to. Using extensive user evaluations, we show that this gaze-contingent mechanism is a faster and more accurate pointing mechanism than alternatives; that it augments human capabilities by significantly improving their accuracy in a dog-breed classification task; and that it is consistently ranked as more natural than head- or body-driven selection mechanisms for contextual AI. Moreover, we prototype a variety of application scenarios that suggest GazeGPT could be of significant value to users as part of future AI-driven personal assistants.

GazeGPT: Augmenting Human Capabilities using Gaze-contingent Contextual AI for Smart Eyewear

TL;DR

GazeGPT advances wearable AI by aligning a gaze-tracking gaze-contingent input with a world-facing camera to provide targeted multimodal context to a large multimodal model. The approach crops around the user’s gaze at multiple scales and feeds these context-rich crops to GPT-4V, enabling improved object understanding and task performance. Across selection-speed/accuracy, augmentation of dog-breed classification, and user-preference studies, gaze-based selection outperforms body- and head-based modes in accuracy and speed, and is consistently viewed as more natural. The work demonstrates near-AI-level performance gains for humans on complex visual tasks and discusses practical deployment considerations, including hardware quality and latency, with clear avenues for real-world application in on-the-go personal assistants.

Abstract

Multimodal large language models (LMMs) excel in world knowledge and problem-solving abilities. Through the use of a world-facing camera and contextual AI, emerging smart accessories aim to provide a seamless interface between humans and LMMs. Yet, these wearable computing systems lack an understanding of the user's attention. We introduce GazeGPT as a new user interaction paradigm for contextual AI. GazeGPT uses eye tracking to help the LMM understand which object in the world-facing camera view a user is paying attention to. Using extensive user evaluations, we show that this gaze-contingent mechanism is a faster and more accurate pointing mechanism than alternatives; that it augments human capabilities by significantly improving their accuracy in a dog-breed classification task; and that it is consistently ranked as more natural than head- or body-driven selection mechanisms for contextual AI. Moreover, we prototype a variety of application scenarios that suggest GazeGPT could be of significant value to users as part of future AI-driven personal assistants.
Paper Structure (32 sections, 7 figures)

This paper contains 32 sections, 7 figures.

Figures (7)

  • Figure 1: We introduce GazeGPT, a human-centric interface to generative AI models. Current AI models are exceptional at ingesting multimodal data and providing reasonable responses, but often lack the fundamental information to identify the object of interest to the human user. GazeGPT uses a combination of a gaze tracker and a world-facing camera to provide context to user queries. The query, along with a multiscale crop around the object of interest, is uploaded to a multimodal large language model, like GPT-4V, which can provide better responses with the included context. This new interface to AI has the potential to fundamentally change how humans access information.
  • Figure 2: The Zinn Labs DK1 Evaluation Kit. The major components used in the GazeGPT system (microphone, speaker, eye tracking cameras, and world-facing camera) are labeled.
  • Figure 3: An illustration of the multiscale capture concept. The narrowest field of view gives a detailed view of the car that the user is looking at, while the wider field of view images provide helpful context. At the same time, the total image size is reduced by an order of magnitude.
  • Figure 4: Results of selection evaluation for accuracy (left) and speed (right) for the selection target shown (top). Both the phone- and gaze- based selection modes achieved high accuracy (just under 2°), while the gaze-based selection mode was the fastest of all the modes. Significance is indicated at the **$p = 0.01$ and ***$p = 0.001$ levels. Errors bars indicate SE.
  • Figure 5: Example images used for the classification evaluation (left) and the results of the evaluation (right). Each trial displayed a 9$\times$9 grid of dog images on a white background to emulate a natural environment that may have many competing objects of interest. Gaze-based selection consistently outperforms the other selection modes and is the only one to outperform the users themselves. Significance is indicated at the **$p = 0.01$ and ***$p = 0.001$ levels. Errors bars indicate SE.
  • ...and 2 more figures