Table of Contents
Fetching ...

I-MPN: Inductive Message Passing Network for Efficient Human-in-the-Loop Annotation of Mobile Eye Tracking Data

Hoang H. Le, Duy M. H. Nguyen, Omair Shahzad Bhatti, Laszlo Kopacsi, Thinh P. Ngo, Binh T. Nguyen, Michael Barz, Daniel Sonntag

TL;DR

This work tackles the bottleneck of manual annotation in mobile eye-tracking data by introducing I-MPN, a hybrid architecture that combines an object detector with an inductive graph neural network to reason about spatial relationships among objects. The system leverages video object segmentation (VoS) to enable efficient, interactive, human-in-the-loop annotations and updates the detector and relational model end-to-end as new feedback arrives, enabling rapid adaptation to dynamic scenes with unseen objects. Across three video sequences, I-MPN demonstrates superior performance to fixed-train baselines and other interactive approaches, while dramatically reducing annotation time (e.g., 60–70% faster label generation) and improving spatial discrimination between closely related objects (e.g., left vs. right devices). Overall, the approach advances robust, data-efficient human-in-the-loop learning for real-time mobile eye-tracking applications, with potential implications for ADAS, human-computer interaction, and on-device processing.

Abstract

Comprehending how humans process visual information in dynamic settings is crucial for psychology and designing user-centered interactions. While mobile eye-tracking systems combining egocentric video and gaze signals can offer valuable insights, manual analysis of these recordings is time-intensive. In this work, we present a novel human-centered learning algorithm designed for automated object recognition within mobile eye-tracking settings. Our approach seamlessly integrates an object detector with a spatial relation-aware inductive message-passing network (I-MPN), harnessing node profile information and capturing object correlations. Such mechanisms enable us to learn embedding functions capable of generalizing to new object angle views, facilitating rapid adaptation and efficient reasoning in dynamic contexts as users navigate their environment. Through experiments conducted on three distinct video sequences, our interactive-based method showcases significant performance improvements over fixed training/testing algorithms, even when trained on considerably smaller annotated samples collected through user feedback. Furthermore, we demonstrate exceptional efficiency in data annotation processes and surpass prior interactive methods that use complete object detectors, combine detectors with convolutional networks, or employ interactive video segmentation.

I-MPN: Inductive Message Passing Network for Efficient Human-in-the-Loop Annotation of Mobile Eye Tracking Data

TL;DR

This work tackles the bottleneck of manual annotation in mobile eye-tracking data by introducing I-MPN, a hybrid architecture that combines an object detector with an inductive graph neural network to reason about spatial relationships among objects. The system leverages video object segmentation (VoS) to enable efficient, interactive, human-in-the-loop annotations and updates the detector and relational model end-to-end as new feedback arrives, enabling rapid adaptation to dynamic scenes with unseen objects. Across three video sequences, I-MPN demonstrates superior performance to fixed-train baselines and other interactive approaches, while dramatically reducing annotation time (e.g., 60–70% faster label generation) and improving spatial discrimination between closely related objects (e.g., left vs. right devices). Overall, the approach advances robust, data-efficient human-in-the-loop learning for real-time mobile eye-tracking applications, with potential implications for ADAS, human-computer interaction, and on-device processing.

Abstract

Comprehending how humans process visual information in dynamic settings is crucial for psychology and designing user-centered interactions. While mobile eye-tracking systems combining egocentric video and gaze signals can offer valuable insights, manual analysis of these recordings is time-intensive. In this work, we present a novel human-centered learning algorithm designed for automated object recognition within mobile eye-tracking settings. Our approach seamlessly integrates an object detector with a spatial relation-aware inductive message-passing network (I-MPN), harnessing node profile information and capturing object correlations. Such mechanisms enable us to learn embedding functions capable of generalizing to new object angle views, facilitating rapid adaptation and efficient reasoning in dynamic contexts as users navigate their environment. Through experiments conducted on three distinct video sequences, our interactive-based method showcases significant performance improvements over fixed training/testing algorithms, even when trained on considerably smaller annotated samples collected through user feedback. Furthermore, we demonstrate exceptional efficiency in data annotation processes and surpass prior interactive methods that use complete object detectors, combine detectors with convolutional networks, or employ interactive video segmentation.
Paper Structure (30 sections, 4 equations, 6 figures, 2 tables, 3 algorithms)

This paper contains 30 sections, 4 equations, 6 figures, 2 tables, 3 algorithms.

Figures (6)

  • Figure 1: Our mobile eye-tracking setup with different viewpoints.
  • Figure 2: Overview our human-in-the-loop I-MPN approach. The bottom dashed arrow indicates the feedback loop. The human interacts with the video object segmentation algorithm to generate annotations used to train an object detector and another graph reasoning network.
  • Figure 3: The video object segmentation-based interface allows users to annotate frames using weak prompts like clicks and scribbles, then propagate these annotations to subsequent frames.
  • Figure 4: Comparative performance analysis.
  • Figure 5: (a) Eye Tracking Point Classification results are improved after upgrading the model with user feedback. Evaluation of different objects given fixation points. (b) Comparison between human-in-the-loop methods on classes requiring spatial object understanding. Results are on balanced accuracy. Higher is better.
  • ...and 1 more figures