Toward Gaze Target Detection of Young Autistic Children
Shijian Deng, Erin E. Kosloski, Siva Sai Nagender Vasireddy, Jia Li, Randi Sierra Sherwood, Feroz Mohamed Hatha, Siddhi Patel, Pamela R Rollins, Yapeng Tian
TL;DR
The paper tackles automatic gaze target detection for young autistic children, a task complicated by a domain shift from neurotypical data and a strong class imbalance that underrepresents face-directed gaze. It introduces the Autism Gaze Target (AGT) dataset and the Socially Aware Coarse-to-Fine (SACF) framework, which uses a Multimodal Large Language Model as a social-context router to dynamically gate two specialized gaze experts. The two-pathway design—one expert focused on social (face-directed) gaze and another on non-social gaze—mitigates class imbalance and improves performance on the clinically critical Face class, achieving state-of-the-art results on several metrics, including a notable reduction in face-target localization errors. The work provides a foundation for AI-assisted, scalable assessment of joint attention in autism, with potential impact on clinical tools and intervention planning.
Abstract
The automatic detection of gaze targets in autistic children through artificial intelligence can be impactful, especially for those who lack access to a sufficient number of professionals to improve their quality of life. This paper introduces a new, real-world AI application for gaze target detection in autistic children, which predicts a child's point of gaze from an activity image. This task is foundational for building automated systems that can measure joint attention-a core challenge in Autism Spectrum Disorder (ASD). To facilitate the study of this challenging application, we collected the first-ever Autism Gaze Target (AGT) dataset. We further propose a novel Socially Aware Coarse-to-Fine (SACF) gaze detection framework that explicitly leverages the social context of a scene to overcome the class imbalance common in autism datasets-a consequence of autistic children's tendency to show reduced gaze to faces. It utilizes a two-pathway architecture with expert models specialized in social and non-social gaze, guided by a context-awareness gate module. The results of our comprehensive experiments demonstrate that our framework achieves new state-of-the-art performance for gaze target detection in this population, significantly outperforming existing methods, especially on the critical minority class of face-directed gaze.
