Learning to Visually Connect Actions and their Effects
Paritosh Parmar, Eric Peh, Basura Fernando
TL;DR
This work proposes CATE, a framework for visually linking actions to their effects in videos, by formulating Action Selection (AS) and Effect-Affinity Assessment (EAA). It introduces a large cross-sample dataset and a spectrum of baselines, demonstrating that humans outperform models and that analogical reasoning yields the strongest AS performance among baselines. The paper also shows that CATE can serve as a self-supervised pretext task, yielding transferable video representations that improve downstream tasks such as action quality assessment (AQA). Overall, CATE reveals core cognitive mechanisms underlying action-effect understanding and invites future models to leverage these cues for planning, learning from demonstration, and interactive AI systems.
Abstract
We introduce the novel concept of visually Connecting Actions and Their Effects (CATE) in video understanding. CATE can have applications in areas like task planning and learning from demonstration. We identify and explore two different aspects of the concept of CATE: Action Selection (AS) and Effect-Affinity Assessment (EAA), where video understanding models connect actions and effects at semantic and fine-grained levels, respectively. We design various baseline models for AS and EAA. Despite the intuitive nature of the task, we observe that models struggle, and humans outperform them by a large margin. Our experiments show that in solving AS and EAA, models learn intuitive properties like object tracking and pose encoding without explicit supervision. We demonstrate that CATE can be an effective self-supervised task for learning video representations from unlabeled videos. The study aims to showcase the fundamental nature and versatility of CATE, with the hope of inspiring advanced formulations and models.
