Table of Contents
Fetching ...

Toward Gaze Target Detection of Young Autistic Children

Shijian Deng, Erin E. Kosloski, Siva Sai Nagender Vasireddy, Jia Li, Randi Sierra Sherwood, Feroz Mohamed Hatha, Siddhi Patel, Pamela R Rollins, Yapeng Tian

TL;DR

The paper tackles automatic gaze target detection for young autistic children, a task complicated by a domain shift from neurotypical data and a strong class imbalance that underrepresents face-directed gaze. It introduces the Autism Gaze Target (AGT) dataset and the Socially Aware Coarse-to-Fine (SACF) framework, which uses a Multimodal Large Language Model as a social-context router to dynamically gate two specialized gaze experts. The two-pathway design—one expert focused on social (face-directed) gaze and another on non-social gaze—mitigates class imbalance and improves performance on the clinically critical Face class, achieving state-of-the-art results on several metrics, including a notable reduction in face-target localization errors. The work provides a foundation for AI-assisted, scalable assessment of joint attention in autism, with potential impact on clinical tools and intervention planning.

Abstract

The automatic detection of gaze targets in autistic children through artificial intelligence can be impactful, especially for those who lack access to a sufficient number of professionals to improve their quality of life. This paper introduces a new, real-world AI application for gaze target detection in autistic children, which predicts a child's point of gaze from an activity image. This task is foundational for building automated systems that can measure joint attention-a core challenge in Autism Spectrum Disorder (ASD). To facilitate the study of this challenging application, we collected the first-ever Autism Gaze Target (AGT) dataset. We further propose a novel Socially Aware Coarse-to-Fine (SACF) gaze detection framework that explicitly leverages the social context of a scene to overcome the class imbalance common in autism datasets-a consequence of autistic children's tendency to show reduced gaze to faces. It utilizes a two-pathway architecture with expert models specialized in social and non-social gaze, guided by a context-awareness gate module. The results of our comprehensive experiments demonstrate that our framework achieves new state-of-the-art performance for gaze target detection in this population, significantly outperforming existing methods, especially on the critical minority class of face-directed gaze.

Toward Gaze Target Detection of Young Autistic Children

TL;DR

The paper tackles automatic gaze target detection for young autistic children, a task complicated by a domain shift from neurotypical data and a strong class imbalance that underrepresents face-directed gaze. It introduces the Autism Gaze Target (AGT) dataset and the Socially Aware Coarse-to-Fine (SACF) framework, which uses a Multimodal Large Language Model as a social-context router to dynamically gate two specialized gaze experts. The two-pathway design—one expert focused on social (face-directed) gaze and another on non-social gaze—mitigates class imbalance and improves performance on the clinically critical Face class, achieving state-of-the-art results on several metrics, including a notable reduction in face-target localization errors. The work provides a foundation for AI-assisted, scalable assessment of joint attention in autism, with potential impact on clinical tools and intervention planning.

Abstract

The automatic detection of gaze targets in autistic children through artificial intelligence can be impactful, especially for those who lack access to a sufficient number of professionals to improve their quality of life. This paper introduces a new, real-world AI application for gaze target detection in autistic children, which predicts a child's point of gaze from an activity image. This task is foundational for building automated systems that can measure joint attention-a core challenge in Autism Spectrum Disorder (ASD). To facilitate the study of this challenging application, we collected the first-ever Autism Gaze Target (AGT) dataset. We further propose a novel Socially Aware Coarse-to-Fine (SACF) gaze detection framework that explicitly leverages the social context of a scene to overcome the class imbalance common in autism datasets-a consequence of autistic children's tendency to show reduced gaze to faces. It utilizes a two-pathway architecture with expert models specialized in social and non-social gaze, guided by a context-awareness gate module. The results of our comprehensive experiments demonstrate that our framework achieves new state-of-the-art performance for gaze target detection in this population, significantly outperforming existing methods, especially on the critical minority class of face-directed gaze.

Paper Structure

This paper contains 27 sections, 10 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: For gaze target detection in young autistic children: (a) Existing off-the-shelf models trained on neurotypical data cannot handle autism scenarios well because autistic individuals can have a different behavior distribution in the same social environment compared to neurotypical individuals, such as avoiding eye contact or not responding to verbal interaction. (b) Models directly fine-tuned on an autism dataset still struggle with the rare case of an autistic child looking at a person's face. (c) Our Socially Aware Coarse-to-Fine framework uses an MLLM as a router to dynamically utilize a Socially Aware Gaze Expert model and a Socially Agnostic Gaze Expert to address these issues.
  • Figure 2: Example from our Autism Gaze Target (AGT) dataset. The child's head is bounded by a blue box. The clinician's and parent's faces are marked with green and red boxes, respectively. The ground-truth target (a toy) is highlighted with a dashed orange box.
  • Figure 3: Annotation agreement. Most gaze targets are objects. Agreement remains high and drops slowly as the IoU Threshold increases, indicating strong consistency between annotators.
  • Figure 4: The architecture of our proposed Socially Aware Coarse-to-Fine (SACF) gaze detection framework. An input image is first processed by the Social Context Awareness (SCA) module to generate a social context score $s$. This score is used by a gate to dynamically assign the input image to two specialized expert pathways: a Socially Aware Gaze Expert ($Ex_{aware}$, Blur Focus) and a Socially Agnostic Gaze Expert ($Ex_{agnostic}$, General). The final prediction is derived from the routed expert outputs, allowing the framework to adapt its specialization based on the scene's social context.
  • Figure 5: A child is playing with an object. However, the baseline Sharingan model trained on the neurotypical Childplay dataset (left) is biased and predicts the child is looking at the clinician's face. When trained on our AGT dataset (right), it correctly predicts the child is looking at the object. The green dot denotes the ground-truth gaze target, while the red dot denotes the predicted target.
  • ...and 1 more figures