Table of Contents
Fetching ...

GazeLLM: Multimodal LLMs incorporating Human Visual Attention

Jun Rekimoto

TL;DR

The paper tackles the high computational burden of Multimodal LLMs on long, high-resolution first-person video by introducing GazeLLM, which crops inputs around the user’s gaze to feed only $1/10$ of the original pixels into an MLLM. Through six real-world task categories and both human and automated evaluations on the Ego-Exo4D dataset, gaze-centered inputs yield equal or superior task descriptions compared to full inputs, while significantly reducing data and processing requirements. The work demonstrates that gaze-informed input reduction can preserve interpretability and accuracy, enabling efficient wearable-enabled AI for real-world assistance and skill transfer. This approach holds potential for scalable, real-time AI-assisted guidance in HCI and human augmentation contexts, where resource constraints and long video streams are common.

Abstract

Large Language Models (LLMs) are advancing into Multimodal LLMs (MLLMs), capable of processing image, audio, and video as well as text. Combining first-person video, MLLMs show promising potential for understanding human activities through video and audio, enabling many human-computer interaction and human-augmentation applications such as human activity support, real-world agents, and skill transfer to robots or other individuals. However, handling high-resolution, long-duration videos generates large latent representations, leading to substantial memory and processing demands, limiting the length and resolution MLLMs can manage. Reducing video resolution can lower memory usage but often compromises comprehension. This paper introduces a method that optimizes first-person video analysis by integrating eye-tracking data, and proposes a method that decomposes first-person vision video into sub areas for regions of gaze focus. By processing these selectively gazed-focused inputs, our approach achieves task comprehension equivalent to or even better than processing the entire image at full resolution, but with significantly reduced video data input (reduce the number of pixels to one-tenth), offering an efficient solution for using MLLMs to interpret and utilize human skills.

GazeLLM: Multimodal LLMs incorporating Human Visual Attention

TL;DR

The paper tackles the high computational burden of Multimodal LLMs on long, high-resolution first-person video by introducing GazeLLM, which crops inputs around the user’s gaze to feed only of the original pixels into an MLLM. Through six real-world task categories and both human and automated evaluations on the Ego-Exo4D dataset, gaze-centered inputs yield equal or superior task descriptions compared to full inputs, while significantly reducing data and processing requirements. The work demonstrates that gaze-informed input reduction can preserve interpretability and accuracy, enabling efficient wearable-enabled AI for real-world assistance and skill transfer. This approach holds potential for scalable, real-time AI-assisted guidance in HCI and human augmentation contexts, where resource constraints and long video streams are common.

Abstract

Large Language Models (LLMs) are advancing into Multimodal LLMs (MLLMs), capable of processing image, audio, and video as well as text. Combining first-person video, MLLMs show promising potential for understanding human activities through video and audio, enabling many human-computer interaction and human-augmentation applications such as human activity support, real-world agents, and skill transfer to robots or other individuals. However, handling high-resolution, long-duration videos generates large latent representations, leading to substantial memory and processing demands, limiting the length and resolution MLLMs can manage. Reducing video resolution can lower memory usage but often compromises comprehension. This paper introduces a method that optimizes first-person video analysis by integrating eye-tracking data, and proposes a method that decomposes first-person vision video into sub areas for regions of gaze focus. By processing these selectively gazed-focused inputs, our approach achieves task comprehension equivalent to or even better than processing the entire image at full resolution, but with significantly reduced video data input (reduce the number of pixels to one-tenth), offering an efficient solution for using MLLMs to interpret and utilize human skills.

Paper Structure

This paper contains 10 sections, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Video task description using GazeLLM: Only the area cropped as a partial rectangle centered on the gaze point in the 1st-person video (indicated by the red dashed outline) is used as input to the MLLM for task verbalization. The output from the LLM includes the task's description and the corresponding video timestamp. The generated description is displayed alongside the video footage at the indicated time.
  • Figure 2: Example of the "stop-and-ask" application: A user wearing a 1st-person vision headset pauses during a task and asks the system, "What should I do next?" Based on the 1st-person vision video recorded up to that point, the system identifies the next required step and provides guidance. It also references the pre-recorded instructional video and indicates the corresponding playback timestamp.
  • Figure 3: Dataflow generation from cooking demonstration.
  • Figure 4: Evaluation tasks: labeled 'Bike', 'Sushi', 'Omelette', 'Soccer', 'PCR' (Polymerase Chain Reaction test), and 'CPR'(cardiopulmonary resuscitation), Red circles indicating viewpoints are attached for reference.
  • Figure 5: Video types for evaluation (Full: Full field of view, Gaze: Video cropped around the gaze point, Center: Video cropped from the center of the image)
  • ...and 5 more figures