Table of Contents
Fetching ...

Towards Unsupervised Eye-Region Segmentation for Eye Tracking

Jiangfan Deng, Zhuang Jia, Zhaoxue Wang, Xiang Long, Daniel K. Du

TL;DR

The unsupervised approach can easily achieve 90% (the pupil and iris) and 85% of the performances under supervised learning and is designed in an end-to-end manner following progressive and prior-aware principle.

Abstract

Finding the eye and parsing out the parts (e.g. pupil and iris) is a key prerequisite for image-based eye tracking, which has become an indispensable module in today's head-mounted VR/AR devices. However, a typical route for training a segmenter requires tedious handlabeling. In this work, we explore an unsupervised way. First, we utilize priors of human eye and extract signals from the image to establish rough clues indicating the eye-region structure. Upon these sparse and noisy clues, a segmentation network is trained to gradually identify the precise area for each part. To achieve accurate parsing of the eye-region, we first leverage the pretrained foundation model Segment Anything (SAM) in an automatic way to refine the eye indications. Then, the learning process is designed in an end-to-end manner following progressive and prior-aware principle. Experiments show that our unsupervised approach can easily achieve 90% (the pupil and iris) and 85% (the whole eye-region) of the performances under supervised learning.

Towards Unsupervised Eye-Region Segmentation for Eye Tracking

TL;DR

The unsupervised approach can easily achieve 90% (the pupil and iris) and 85% of the performances under supervised learning and is designed in an end-to-end manner following progressive and prior-aware principle.

Abstract

Finding the eye and parsing out the parts (e.g. pupil and iris) is a key prerequisite for image-based eye tracking, which has become an indispensable module in today's head-mounted VR/AR devices. However, a typical route for training a segmenter requires tedious handlabeling. In this work, we explore an unsupervised way. First, we utilize priors of human eye and extract signals from the image to establish rough clues indicating the eye-region structure. Upon these sparse and noisy clues, a segmentation network is trained to gradually identify the precise area for each part. To achieve accurate parsing of the eye-region, we first leverage the pretrained foundation model Segment Anything (SAM) in an automatic way to refine the eye indications. Then, the learning process is designed in an end-to-end manner following progressive and prior-aware principle. Experiments show that our unsupervised approach can easily achieve 90% (the pupil and iris) and 85% (the whole eye-region) of the performances under supervised learning.
Paper Structure (14 sections, 11 equations, 8 figures, 6 tables)

This paper contains 14 sections, 11 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Demonstration of our task. The upper image is an example of the photo captured by the near-infrared camera pointing to the eye. Each part of the eye (the pupil, iris and the sclera) is marked using an individual color. Our goal is paring the eye-region: finding out the area of pupil, iris and the whole region of eye from the image, or more specifically, doing a semantic segmentation of the eye and its parts (as shown in the lower image). We seek to accomplish this task through an unsupervised way.
  • Figure 2: Overview of our method. The segmentation network has two heads: one for pupil/iris and the other for eye-region. Training of pupil/iris segmentation is first activated using indications generated by PII. Then the eye segmentation head is unlocked to update, using indications from EI, which utilizes the pupil/iris prediction. After that, both tasks are running simultaneously. During the training process, a progressive learning module is used for each task, resisting the noise and refining the outputs.
  • Figure 3: Pupil-Iris Indicator. First, gradients on the input image are computed using Sobel operator (a,b). Then, Eq.(\ref{['eq:cos']}) and Eq.(\ref{['eq:gdf']}) are adopted to filter gradients within each window (b,c,d). After that, an indicating process based on a set of rays originating from $p_o$ is a applied (e,f,g), generating sparse indications for pupil and iris (h).
  • Figure 4: Initial Indication of eye. (a) Compute GSTD on the gradient map. (b) Smooth areas on the image (marked in light blue). (c) Indicate the eye. (d) Output indications: green: eye (foreground), red: background, others: ignored.
  • Figure 5: Indication Refinement with SAM. The initial indications $\Phi$ are used to generate point-prompts $\mathcal{P}$ to interact with SAM (a). Based on the SAM output $\mathcal{O}$, the set $\mathcal{B}$ of points on the rough contour of eye is acquired (b,c). Then, contour sections which are not smooth will be filtered by computing second-order derivatives (c,d). The final "reliable but incomplete" eye indications $\hat{\Phi}$ is obtained (d).
  • ...and 3 more figures