Beyond Sequences: A Benchmark for Atomic Hand-Object Interaction Using a Static RNN Encoder
Yousef Azizi Movahed, Fatemeh Ziaeetabar
TL;DR
The paper tackles fine-grained anticipation of hand-object interactions by converting raw MANIAC videos into a structured, statistical–kinematic feature dataset of 27,476 samples, then evaluating static and temporal models for atomic state classification. Counterintuitively, turning a Bidirectional RNN into a seq_length=1 static encoder yields substantial gains, achieving 97.60% accuracy and a balanced F1 of 0.90 for the challenging 'grabbing' state. This work establishes a robust, interpretable benchmark for low-level HOI state recognition using engineered features and lightweight architectures, offering a practical baseline for future exploration with more complex models like GNNs. It also highlights the potential of reinterpreting recurrent cells as static encoders when feature representations are rich enough.
Abstract
Reliably predicting human intent in hand-object interactions is an open challenge for computer vision. Our research concentrates on a fundamental sub-problem: the fine-grained classification of atomic interaction states, namely 'approaching', 'grabbing', and 'holding'. To this end, we introduce a structured data engineering process that converts raw videos from the MANIAC dataset into 27,476 statistical-kinematic feature vectors. Each vector encapsulates relational and dynamic properties from a short temporal window of motion. Our initial hypothesis posited that sequential modeling would be critical, leading us to compare static classifiers (MLPs) against temporal models (RNNs). Counter-intuitively, the key discovery occurred when we set the sequence length of a Bidirectional RNN to one (seq_length=1). This modification converted the network's function, compelling it to act as a high-capacity static feature encoder. This architectural change directly led to a significant accuracy improvement, culminating in a final score of 97.60%. Of particular note, our optimized model successfully overcame the most challenging transitional class, 'grabbing', by achieving a balanced F1-score of 0.90. These findings provide a new benchmark for low-level hand-object interaction recognition using structured, interpretable features and lightweight architectures.
