Table of Contents
Fetching ...

GazeGen: Gaze-Driven User Interaction for Visual Content Generation

He-Yen Hsieh, Ziyun Li, Sai Qian Zhang, Wei-Te Mark Ting, Kao-Den Chang, Barbara De Salvo, Chiao Liu, H. T. Kung

TL;DR

GazeGen is the first system to combine visual content generation with real-time gaze estimation, made possible exclusively by DFT Gaze, an ultra-lightweight model with only 281K parameters performing accurate real-time gaze predictions tailored to individual users' eyes on small edge devices.

Abstract

We present GazeGen, a user interaction system that generates visual content (images and videos) for locations indicated by the user's eye gaze. GazeGen allows intuitive manipulation of visual content by targeting regions of interest with gaze. Using advanced techniques in object detection and generative AI, GazeGen performs gaze-controlled image adding/deleting, repositioning, and surface style changes of image objects, and converts static images into videos. Central to GazeGen is the DFT Gaze (Distilled and Fine-Tuned Gaze) agent, an ultra-lightweight model with only 281K parameters, performing accurate real-time gaze predictions tailored to individual users' eyes on small edge devices. GazeGen is the first system to combine visual content generation with real-time gaze estimation, made possible exclusively by DFT Gaze. This real-time gaze estimation enables various visual content generation tasks, all controlled by the user's gaze. The input for DFT Gaze is the user's eye images, while the inputs for visual content generation are the user's view and the predicted gaze point from DFT Gaze. To achieve efficient gaze predictions, we derive the small model from a large model (10x larger) via novel knowledge distillation and personal adaptation techniques. We integrate knowledge distillation with a masked autoencoder, developing a compact yet powerful gaze estimation model. This model is further fine-tuned with Adapters, enabling highly accurate and personalized gaze predictions with minimal user input. DFT Gaze ensures low-latency and precise gaze tracking, supporting a wide range of gaze-driven tasks. We validate the performance of DFT Gaze on AEA and OpenEDS2020 benchmarks, demonstrating low angular gaze error and low latency on the edge device (Raspberry Pi 4). Furthermore, we describe applications of GazeGen, illustrating its versatility and effectiveness in various usage scenarios.

GazeGen: Gaze-Driven User Interaction for Visual Content Generation

TL;DR

GazeGen is the first system to combine visual content generation with real-time gaze estimation, made possible exclusively by DFT Gaze, an ultra-lightweight model with only 281K parameters performing accurate real-time gaze predictions tailored to individual users' eyes on small edge devices.

Abstract

We present GazeGen, a user interaction system that generates visual content (images and videos) for locations indicated by the user's eye gaze. GazeGen allows intuitive manipulation of visual content by targeting regions of interest with gaze. Using advanced techniques in object detection and generative AI, GazeGen performs gaze-controlled image adding/deleting, repositioning, and surface style changes of image objects, and converts static images into videos. Central to GazeGen is the DFT Gaze (Distilled and Fine-Tuned Gaze) agent, an ultra-lightweight model with only 281K parameters, performing accurate real-time gaze predictions tailored to individual users' eyes on small edge devices. GazeGen is the first system to combine visual content generation with real-time gaze estimation, made possible exclusively by DFT Gaze. This real-time gaze estimation enables various visual content generation tasks, all controlled by the user's gaze. The input for DFT Gaze is the user's eye images, while the inputs for visual content generation are the user's view and the predicted gaze point from DFT Gaze. To achieve efficient gaze predictions, we derive the small model from a large model (10x larger) via novel knowledge distillation and personal adaptation techniques. We integrate knowledge distillation with a masked autoencoder, developing a compact yet powerful gaze estimation model. This model is further fine-tuned with Adapters, enabling highly accurate and personalized gaze predictions with minimal user input. DFT Gaze ensures low-latency and precise gaze tracking, supporting a wide range of gaze-driven tasks. We validate the performance of DFT Gaze on AEA and OpenEDS2020 benchmarks, demonstrating low angular gaze error and low latency on the edge device (Raspberry Pi 4). Furthermore, we describe applications of GazeGen, illustrating its versatility and effectiveness in various usage scenarios.

Paper Structure

This paper contains 18 sections, 5 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: GazeGen. (1) User's View: Overview of the user's view, setting the context for gaze estimation (input: user's eye images) and visual editing (inputs: user's view and predicted gaze point). (2) Real-Time Gaze Estimation: The DFT Gaze Agent (281KB storage) predicts the user's gaze point (green) aligned with the ground-truth gaze (red). (3) Gaze-Driven Visual Content Generation/Detection: Predicted gaze is used for editing () objects, detecting () objects, or creating animations () based on the user's focus (). GazeGen sets a new standard for gaze-driven visual content generation, enhancing user experience and positioning users as visual creators.
  • Figure 2: Extended applications of gaze-driven interaction with GazeGen. (1) Real-Time Gaze Estimation: Continuous tracking of eye movements for precise gaze estimation. (2) Gaze-Driven Detection: Detecting and identifying objects based on where the user is looking. (3) Gaze-Driven Image Editing: Dynamic editing tasks such as Addition (adding objects based on the user's gaze), Deletion/Replacement (removing or replacing objects based on the user's gaze), Reposition (move objects by first gazing at the initial position, then the new position), and Style Transfer (change an object's style or texture by first gazing at a reference object, then applying the style to the target object). (4) Gaze-Driven Video Generation: Creating and manipulating video content driven by the user's gaze.
  • Figure 3: Gaze-driven visual content generation. This diagram shows the process starting from the user's eye, where the gaze estimation agent determines the gaze point. The gaze point is used to get the editing region, which can be toggled to use either a box or a mask. The T2I (Text-to-Image) and T2V (Text-to-Video) modules then generate visual content based on the selected editing region. The On/Off switches indicate whether the box or mask is used for gaze-driven editing.
  • Figure 4: Self-supervised distillation for a compact model. Using ConvNeXt V2-A WooDHC0KX23 as the teacher network, we create a downsized student network. The first stage of the student model inherits weights from the teacher, while stages 2 to 4 reduce the channel dimensions to one-fourth. Distinct decoders are used to reconstruct both input images and the teacher’s intermediate features. The student processes masked inputs, allowing it to emulate the teacher's deep understanding of visual data and align with how the teacher perceives and interprets these images. For simplicity, the diagram only illustrates the reconstruction of the teacher’s features to emulate knowledge.
  • Figure 5: Qualitative results on AEA dataset. First row: user’s eye. Second row: eye tracking (left) and gaze-driven object detection (right). Predicted gaze (green), ground-truth gaze (red). Best viewed in Acrobat Reader; click images to play animations.
  • ...and 7 more figures