Interact with me: Joint Egocentric Forecasting of Intent to Interact, Attitude and Social Actions
Tongfei Bian, Yiming Ma, Mathieu Chollet, Victor Sanchez, Tanaya Guha
TL;DR
The paper defines a novel task of jointly forecasting a user’s intent to interact, attitude toward the agent, and future action from an egocentric viewpoint and introduces SocialEgoNet, a graph-based multitask framework that operates on whole-body pose graphs derived from 1 second of video. The architecture combines three body-part specific GCNs, multi-head self-attention, and a Bi-LSTM to produce a rich spatiotemporal embedding, followed by a hierarchical multitask classifier with Chain design to mimic human perception. To enable this study, the authors augment the JPL dataset to JPL-Social with per-person labels for intent, attitude, and actions, and demonstrate real-time inference with superior average accuracy (83.15%) compared with baselines, while achieving smaller model size and faster latency. The work advances proactive human-agent interaction by delivering accurate, low-latency social cues, with potential for deployment in real-world robots and virtual agents; future work includes multimodal cues and in-the-wild evaluation.
Abstract
For efficient human-agent interaction, an agent should proactively recognize their target user and prepare for upcoming interactions. We formulate this challenging problem as the novel task of jointly forecasting a person's intent to interact with the agent, their attitude towards the agent and the action they will perform, from the agent's (egocentric) perspective. So we propose \emph{SocialEgoNet} - a graph-based spatiotemporal framework that exploits task dependencies through a hierarchical multitask learning approach. SocialEgoNet uses whole-body skeletons (keypoints from face, hands and body) extracted from only 1 second of video input for high inference speed. For evaluation, we augment an existing egocentric human-agent interaction dataset with new class labels and bounding box annotations. Extensive experiments on this augmented dataset, named JPL-Social, demonstrate \emph{real-time} inference and superior performance (average accuracy across all tasks: 83.15\%) of our model outperforming several competitive baselines. The additional annotations and code will be available upon acceptance.
