Table of Contents
Fetching ...

Foveated Instance Segmentation

Hongyi Zeng, Wenxuan Liu, Tianhua Xia, Jinhui Chen, Ziyun Li, Sai Qian Zhang

TL;DR

The paper tackles AR/VR-driven instance segmentation by exploiting gaze to perform foveated processing, eliminating full-frame computation in favor of IOI-focused segmentation. It introduces FSNet, a gaze-aware, plug-and-play network with a two-branch design that outputs an IOI mask and a class label, coupled with a saliency-guided, deformable downsampling mechanism and a robust, balanced loss that handles small IOI objects. Building on FSNet, the authors propose FovealSeg, a framework that leverages temporal gaze patterns to reuse segmentation results across frames, with a saccade-aware control flow to maximize efficiency. The approach yields significant improvements in IoU on ADE20K and LVIS (e.g., $IoU=0.56$ and $IoU'=0.66$ in some configurations) and dramatic reductions in computation (up to $1.96\times$ FLOPs and up to $75\times$ compared to full-frame baselines), enabling real-time AR/VR performance with low latency.

Abstract

Instance segmentation is essential for augmented reality and virtual reality (AR/VR) as it enables precise object recognition and interaction, enhancing the integration of virtual and real-world elements for an immersive experience. However, the high computational overhead of segmentation limits its application on resource-constrained AR/VR devices, causing large processing latency and degrading user experience. In contrast to conventional scenarios, AR/VR users typically focus on only a few regions within their field of view before shifting perspective, allowing segmentation to be concentrated on gaze-specific areas. This insight drives the need for efficient segmentation methods that prioritize processing instance of interest, reducing computational load and enhancing real-time performance. In this paper, we present a foveated instance segmentation (FovealSeg) framework that leverages real-time user gaze data to perform instance segmentation exclusively on instance of interest, resulting in substantial computational savings. Evaluation results show that FSNet achieves an IoU of 0.56 on ADE20K and 0.54 on LVIS, notably outperforming the baseline. The code is available at https://github.com/SAI-

Foveated Instance Segmentation

TL;DR

The paper tackles AR/VR-driven instance segmentation by exploiting gaze to perform foveated processing, eliminating full-frame computation in favor of IOI-focused segmentation. It introduces FSNet, a gaze-aware, plug-and-play network with a two-branch design that outputs an IOI mask and a class label, coupled with a saliency-guided, deformable downsampling mechanism and a robust, balanced loss that handles small IOI objects. Building on FSNet, the authors propose FovealSeg, a framework that leverages temporal gaze patterns to reuse segmentation results across frames, with a saccade-aware control flow to maximize efficiency. The approach yields significant improvements in IoU on ADE20K and LVIS (e.g., and in some configurations) and dramatic reductions in computation (up to FLOPs and up to compared to full-frame baselines), enabling real-time AR/VR performance with low latency.

Abstract

Instance segmentation is essential for augmented reality and virtual reality (AR/VR) as it enables precise object recognition and interaction, enhancing the integration of virtual and real-world elements for an immersive experience. However, the high computational overhead of segmentation limits its application on resource-constrained AR/VR devices, causing large processing latency and degrading user experience. In contrast to conventional scenarios, AR/VR users typically focus on only a few regions within their field of view before shifting perspective, allowing segmentation to be concentrated on gaze-specific areas. This insight drives the need for efficient segmentation methods that prioritize processing instance of interest, reducing computational load and enhancing real-time performance. In this paper, we present a foveated instance segmentation (FovealSeg) framework that leverages real-time user gaze data to perform instance segmentation exclusively on instance of interest, resulting in substantial computational savings. Evaluation results show that FSNet achieves an IoU of 0.56 on ADE20K and 0.54 on LVIS, notably outperforming the baseline. The code is available at https://github.com/SAI-

Paper Structure

This paper contains 24 sections, 7 equations, 7 figures, 6 tables, 1 algorithm.

Figures (7)

  • Figure 1: (a) An example on gaze location for the AR user. (b) Trace of eye gaze within a segment.
  • Figure 2: An example on fixation and saccade of human eye.
  • Figure 3: Processing latency of segmentation task on edge GPUs.
  • Figure 4: (a) Images and corresponding gaze locations from the Aria Everyday dataset lv2024ariaeverydayactivitiesdataset, collected from real users wearing a VR headset. (b) Left: Changes on gaze locations over the frames. Right: Histogram of gaze differences with the yellow line marking the $95\%$ threshold. (c) Left: Normalized pixelwise differences across frames, with gray blocks indicating frames within the same segments. A $0.037$ threshold is used to group similar frames. Right: The histogram of image differences. $35\%$ of pairs of consecutive frames with a difference less than $0.037$.
  • Figure 5: An overview of FovealSeg framework.
  • ...and 2 more figures