Table of Contents
Fetching ...

Interact with me: Joint Egocentric Forecasting of Intent to Interact, Attitude and Social Actions

Tongfei Bian, Yiming Ma, Mathieu Chollet, Victor Sanchez, Tanaya Guha

TL;DR

The paper defines a novel task of jointly forecasting a user’s intent to interact, attitude toward the agent, and future action from an egocentric viewpoint and introduces SocialEgoNet, a graph-based multitask framework that operates on whole-body pose graphs derived from 1 second of video. The architecture combines three body-part specific GCNs, multi-head self-attention, and a Bi-LSTM to produce a rich spatiotemporal embedding, followed by a hierarchical multitask classifier with Chain design to mimic human perception. To enable this study, the authors augment the JPL dataset to JPL-Social with per-person labels for intent, attitude, and actions, and demonstrate real-time inference with superior average accuracy (83.15%) compared with baselines, while achieving smaller model size and faster latency. The work advances proactive human-agent interaction by delivering accurate, low-latency social cues, with potential for deployment in real-world robots and virtual agents; future work includes multimodal cues and in-the-wild evaluation.

Abstract

For efficient human-agent interaction, an agent should proactively recognize their target user and prepare for upcoming interactions. We formulate this challenging problem as the novel task of jointly forecasting a person's intent to interact with the agent, their attitude towards the agent and the action they will perform, from the agent's (egocentric) perspective. So we propose \emph{SocialEgoNet} - a graph-based spatiotemporal framework that exploits task dependencies through a hierarchical multitask learning approach. SocialEgoNet uses whole-body skeletons (keypoints from face, hands and body) extracted from only 1 second of video input for high inference speed. For evaluation, we augment an existing egocentric human-agent interaction dataset with new class labels and bounding box annotations. Extensive experiments on this augmented dataset, named JPL-Social, demonstrate \emph{real-time} inference and superior performance (average accuracy across all tasks: 83.15\%) of our model outperforming several competitive baselines. The additional annotations and code will be available upon acceptance.

Interact with me: Joint Egocentric Forecasting of Intent to Interact, Attitude and Social Actions

TL;DR

The paper defines a novel task of jointly forecasting a user’s intent to interact, attitude toward the agent, and future action from an egocentric viewpoint and introduces SocialEgoNet, a graph-based multitask framework that operates on whole-body pose graphs derived from 1 second of video. The architecture combines three body-part specific GCNs, multi-head self-attention, and a Bi-LSTM to produce a rich spatiotemporal embedding, followed by a hierarchical multitask classifier with Chain design to mimic human perception. To enable this study, the authors augment the JPL dataset to JPL-Social with per-person labels for intent, attitude, and actions, and demonstrate real-time inference with superior average accuracy (83.15%) compared with baselines, while achieving smaller model size and faster latency. The work advances proactive human-agent interaction by delivering accurate, low-latency social cues, with potential for deployment in real-world robots and virtual agents; future work includes multimodal cues and in-the-wild evaluation.

Abstract

For efficient human-agent interaction, an agent should proactively recognize their target user and prepare for upcoming interactions. We formulate this challenging problem as the novel task of jointly forecasting a person's intent to interact with the agent, their attitude towards the agent and the action they will perform, from the agent's (egocentric) perspective. So we propose \emph{SocialEgoNet} - a graph-based spatiotemporal framework that exploits task dependencies through a hierarchical multitask learning approach. SocialEgoNet uses whole-body skeletons (keypoints from face, hands and body) extracted from only 1 second of video input for high inference speed. For evaluation, we augment an existing egocentric human-agent interaction dataset with new class labels and bounding box annotations. Extensive experiments on this augmented dataset, named JPL-Social, demonstrate \emph{real-time} inference and superior performance (average accuracy across all tasks: 83.15\%) of our model outperforming several competitive baselines. The additional annotations and code will be available upon acceptance.

Paper Structure

This paper contains 13 sections, 4 figures, 4 tables, 1 algorithm.

Figures (4)

  • Figure 1: Overview of the proposed model, SocialEgoNet, for the joint forecasting task we introduce. We first utilize AlphaPose to generate whole-body keypoints from the input video clip. The keypoints of the face, body, and hands are then fed into separate Graph Neural Networks (GCNs) to extract spatial representations, which are subsequently fused by concatenation and then passed through a multi-head self-attention module. The fused spatial representations from each frame are concatenated and the resulting sequence is fed into a bidirectional LSTM to model temporal relationships. Finally, the spatiotemporal feature output from the Bi-LSTM is passed to a hierarchical classifier to generate results for the three tasks.
  • Figure 2: JPL-Social class distribution for the three tasks—intent (in bold), attitude (in italic), and actions (in regular). The number of samples in each class is indicated in parentheses, and newly introduced labels are underscored.
  • Figure 3: The proposed designs of our hierarchical classifiers.
  • Figure 4: Effect of the observation window size on SocialEgoNet's performance.