GazeOnce360: Fisheye-Based 360° Multi-Person Gaze Estimation with Global-Local Feature Fusion

Zhuojiang Cai; Zhenghui Sun; Feng Lu

GazeOnce360: Fisheye-Based 360° Multi-Person Gaze Estimation with Global-Local Feature Fusion

Zhuojiang Cai, Zhenghui Sun, Feng Lu

Abstract

We present GazeOnce360, a novel end-to-end model for multi-person gaze estimation from a single tabletop-mounted upward-facing fisheye camera. Unlike conventional approaches that rely on forward-facing cameras in constrained viewpoints, we address the underexplored setting of estimating the 3D gaze direction of multiple people distributed across a 360° scene from an upward fisheye perspective. To support research in this setting, we introduce MPSGaze360, a large-scale synthetic dataset rendered using Unreal Engine, featuring diverse multi-person configurations with accurate 3D gaze and eye landmark annotations. Our model tackles the severe distortion and perspective variation inherent in fisheye imagery by incorporating rotational convolutions and eye landmark supervision. To better capture fine-grained eye features crucial for gaze estimation, we propose a dual-resolution architecture that fuses global low-resolution context with high-resolution local eye regions. Experimental results demonstrate the effectiveness of each component in our model. This work highlights the feasibility and potential of fisheye-based 360° gaze estimation in practical multi-person scenarios. Project page: https://caizhuojiang.github.io/GazeOnce360/.

GazeOnce360: Fisheye-Based 360° Multi-Person Gaze Estimation with Global-Local Feature Fusion

Abstract

Paper Structure (42 sections, 7 equations, 8 figures, 10 tables)

This paper contains 42 sections, 7 equations, 8 figures, 10 tables.

Introduction
Related Work
Gaze Estimation Datasets
Gaze Estimation Methods
Fisheye Imaging and Perception
Problem Formulation
Synthetic Dataset: MPSGaze360
Generation Pipeline
Ground-Truth Annotations
Dataset Statistics
GazeOnce360
Preliminaries: Rotational Convolution
Dual-Resolution Feature Fusion
Feature Extraction.
Cross-Attention Fusion.
...and 27 more sections

Figures (8)

Figure 1: Comparison between the existing method caiGam360SensingGaze2025 and the proposed GazeOnce360. The existing multi-step pipeline (top) is compared against the proposed end-to-end GazeOnce360 model (bottom), highlighting the transition from a multi-step pipeline to a more efficient and scalable end-to-end solution with improved robustness and simplicity.
Figure 2: Overview of the MPSGaze360 data generation pipeline. We first load a virtual indoor environment and populate it with multiple MetaHuman characters of diverse appearances. For each subject, we randomly sample head orientation $(\alpha_i^h, \beta_i^h, \gamma_i^h)$, gaze direction $(\alpha_i^g, \beta_i^g)$, and eyelid closure $c_i$. The process also includes extraction of 2D landmarks and gaze vectors, annotation aggregation, and compliance validation. Each sample is rendered as five orthogonal perspective views and subsequently projected into a single 180° equidistant fisheye image.
Figure 3: Visualization of the proposed MPSGaze360 dataset. These examples demonstrate the realism, diversity, and annotation accuracy of the synthetic dataset.
Figure 4: Overall architecture of GazeOnce360. GazeOnce360 is an anchor-based CNN with rotational convolutions. The low-resolution fisheye image is processed by the global branch to capture large-scale spatial context and to detect face bounding boxes. For each detected region, the local branch extracts high-resolution facial features to capture fine-grained gaze cues. A cross-attention module fuses global and local representations, followed by multi-task heads that predict landmarks and gaze directions. During training, an additional supervision signal is applied to the local branch to regress per-face gaze for improved ocular feature learning.
Figure 5: Qualitative comparison of gaze prediction with and without eye landmark supervision. Each column shows predicted gaze directions (yellow) and ground truth (red). Without eye landmark supervision (left), predictions deviate under extreme head poses, while adding landmark supervision (right) captures fine-grained ocular cues for more accurate gaze directions.
...and 3 more figures

GazeOnce360: Fisheye-Based 360° Multi-Person Gaze Estimation with Global-Local Feature Fusion

Abstract

GazeOnce360: Fisheye-Based 360° Multi-Person Gaze Estimation with Global-Local Feature Fusion

Authors

Abstract

Table of Contents

Figures (8)