Inferring Past Human Actions in Homes with Abductive Reasoning
Clement Tan, Chai Kiat Yeo, Cheston Tan, Basura Fernando
TL;DR
This work defines abductive past action inference: inferring plausible past human actions from a single image by leveraging current scene evidence. It introduces an object-relational representation framework and a set of architectures (GNNED, RBP, and BiGED) to reason over human–object relations, with BiGED achieving the strongest performance by fusing bilinear pooling and a relational graph encoder–decoder. Evaluations on the Action Genome/Charades-derived dataset show that object-relational approaches outperform end-to-end and vision–language baselines, though humans still outperform AI, highlighting the challenge and value of relational reasoning for abductive inference. The findings suggest practical impact for human–robot interaction and elder care, where understanding past actions from present evidence can improve safety and decision-making; code and data are released to facilitate further research.
Abstract
Abductive reasoning aims to make the most likely inference for a given set of incomplete observations. In this paper, we introduce "Abductive Past Action Inference", a novel research task aimed at identifying the past actions performed by individuals within homes to reach specific states captured in a single image, using abductive inference. The research explores three key abductive inference problems: past action set prediction, past action sequence prediction, and abductive past action verification. We introduce several models tailored for abductive past action inference, including a relational graph neural network, a relational bilinear pooling model, and a relational transformer model. Notably, the newly proposed object-relational bilinear graph encoder-decoder (BiGED) model emerges as the most effective among all methods evaluated, demonstrating good proficiency in handling the intricacies of the Action Genome dataset. The contributions of this research significantly advance the ability of deep learning models to reason about current scene evidence and make highly plausible inferences about past human actions. This advancement enables a deeper understanding of events and behaviors, which can enhance decision-making and improve system capabilities across various real-world applications such as Human-Robot Interaction and Elderly Care and Health Monitoring. Code and data available at https://github.com/LUNAProject22/AAR
