Eyes on the Game: Deciphering Implicit Human Signals to Infer Human Proficiency, Trust, and Intent
Nikhil Hulle, Stéphane Aroca-Ouellette, Anthony J. Ries, Jake Brawer, Katharina von der Wense, Alessandro Roncone
TL;DR
This work tackles how to infer human proficiency, trust, and intent in human-AI collaboration from implicit cues. It collects a large, public dataset of paired eye gaze and gameplay data in Overcooked and trains a causal transformer to predict per-timestep trust, per-round proficiency, and upcoming subtasks, comparing eye gaze, gameplay, and their combination. Key findings show that eye gaze yields strong early signals, gameplay strengthens with task progression, and integrating both modalities provides the best predictive performance, with careful attention to gaze-data representations. The dataset and analysis offer practical guidance for building adaptive agents capable of rapidly aligning with new teammates in fast-paced settings, and the authors release resources to support replication and further research.
Abstract
Effective collaboration between humans and AIs hinges on transparent communication and alignment of mental models. However, explicit, verbal communication is not always feasible. Under such circumstances, human-human teams often depend on implicit, nonverbal cues to glean important information about their teammates such as intent and expertise, thereby bolstering team alignment and adaptability. Among these implicit cues, two of the most salient and fundamental are a human's actions in the environment and their visual attention. In this paper, we present a novel method to combine eye gaze data and behavioral data, and evaluate their respective predictive power for human proficiency, trust, and intent. We first collect a dataset of paired eye gaze and gameplay data in the fast-paced collaborative "Overcooked" environment. We then train models on this dataset to compare how the predictive powers differ between gaze data, gameplay data, and their combination. We additionally compare our method to prior works that aggregate eye gaze data and demonstrate how these aggregation methods can substantially reduce the predictive ability of eye gaze. Our results indicate that, while eye gaze data and gameplay data excel in different situations, a model that integrates both types consistently outperforms all baselines. This work paves the way for developing intuitive and responsive agents that can efficiently adapt to new teammates.
