Table of Contents
Fetching ...

Eyes on the Game: Deciphering Implicit Human Signals to Infer Human Proficiency, Trust, and Intent

Nikhil Hulle, Stéphane Aroca-Ouellette, Anthony J. Ries, Jake Brawer, Katharina von der Wense, Alessandro Roncone

TL;DR

This work tackles how to infer human proficiency, trust, and intent in human-AI collaboration from implicit cues. It collects a large, public dataset of paired eye gaze and gameplay data in Overcooked and trains a causal transformer to predict per-timestep trust, per-round proficiency, and upcoming subtasks, comparing eye gaze, gameplay, and their combination. Key findings show that eye gaze yields strong early signals, gameplay strengthens with task progression, and integrating both modalities provides the best predictive performance, with careful attention to gaze-data representations. The dataset and analysis offer practical guidance for building adaptive agents capable of rapidly aligning with new teammates in fast-paced settings, and the authors release resources to support replication and further research.

Abstract

Effective collaboration between humans and AIs hinges on transparent communication and alignment of mental models. However, explicit, verbal communication is not always feasible. Under such circumstances, human-human teams often depend on implicit, nonverbal cues to glean important information about their teammates such as intent and expertise, thereby bolstering team alignment and adaptability. Among these implicit cues, two of the most salient and fundamental are a human's actions in the environment and their visual attention. In this paper, we present a novel method to combine eye gaze data and behavioral data, and evaluate their respective predictive power for human proficiency, trust, and intent. We first collect a dataset of paired eye gaze and gameplay data in the fast-paced collaborative "Overcooked" environment. We then train models on this dataset to compare how the predictive powers differ between gaze data, gameplay data, and their combination. We additionally compare our method to prior works that aggregate eye gaze data and demonstrate how these aggregation methods can substantially reduce the predictive ability of eye gaze. Our results indicate that, while eye gaze data and gameplay data excel in different situations, a model that integrates both types consistently outperforms all baselines. This work paves the way for developing intuitive and responsive agents that can efficiently adapt to new teammates.

Eyes on the Game: Deciphering Implicit Human Signals to Infer Human Proficiency, Trust, and Intent

TL;DR

This work tackles how to infer human proficiency, trust, and intent in human-AI collaboration from implicit cues. It collects a large, public dataset of paired eye gaze and gameplay data in Overcooked and trains a causal transformer to predict per-timestep trust, per-round proficiency, and upcoming subtasks, comparing eye gaze, gameplay, and their combination. Key findings show that eye gaze yields strong early signals, gameplay strengthens with task progression, and integrating both modalities provides the best predictive performance, with careful attention to gaze-data representations. The dataset and analysis offer practical guidance for building adaptive agents capable of rapidly aligning with new teammates in fast-paced settings, and the authors release resources to support replication and further research.

Abstract

Effective collaboration between humans and AIs hinges on transparent communication and alignment of mental models. However, explicit, verbal communication is not always feasible. Under such circumstances, human-human teams often depend on implicit, nonverbal cues to glean important information about their teammates such as intent and expertise, thereby bolstering team alignment and adaptability. Among these implicit cues, two of the most salient and fundamental are a human's actions in the environment and their visual attention. In this paper, we present a novel method to combine eye gaze data and behavioral data, and evaluate their respective predictive power for human proficiency, trust, and intent. We first collect a dataset of paired eye gaze and gameplay data in the fast-paced collaborative "Overcooked" environment. We then train models on this dataset to compare how the predictive powers differ between gaze data, gameplay data, and their combination. We additionally compare our method to prior works that aggregate eye gaze data and demonstrate how these aggregation methods can substantially reduce the predictive ability of eye gaze. Our results indicate that, while eye gaze data and gameplay data excel in different situations, a model that integrates both types consistently outperforms all baselines. This work paves the way for developing intuitive and responsive agents that can efficiently adapt to new teammates.
Paper Structure (16 sections, 5 figures)

This paper contains 16 sections, 5 figures.

Figures (5)

  • Figure 1: In this work, collect a large dataset of paired eye gaze and gameplay data in the collaborative game "Overcooked." Using this data, we train a causal transformer demonstrating state-of-the-art performance in its ability to predict a collaborator's task proficiency, trust in an autonomous teammate, and future intent.
  • Figure 2: The three "Overcooked" layouts used. From oai.
  • Figure 3: An overview of the processing method to create representations of eye gaze data, gameplay data, and enable a combination of the two for a single timestep. The representations are designed to be easily fed into modern neural networks.
  • Figure 4: F1 scores over time for different implicit human signals predicting human proficiency, trust, and future intents starting at timestep $\mathbf{0}$ of each trial. The top row of graphs shows the per-timestep prediction outputted by our transformer model that can handle time-series data. The bottom row shows the cumulative prediction of all past timesteps. Dotted lines represent methods that aggregate over time and use the full $20$ second window for their prediction.
  • Figure 5: F1 scores starting at timestep $\mathbf{200}$. Refer to \ref{['fig:main_results']} for a full description of the figure