Table of Contents
Fetching ...

Beyond Sequences: A Benchmark for Atomic Hand-Object Interaction Using a Static RNN Encoder

Yousef Azizi Movahed, Fatemeh Ziaeetabar

TL;DR

The paper tackles fine-grained anticipation of hand-object interactions by converting raw MANIAC videos into a structured, statistical–kinematic feature dataset of 27,476 samples, then evaluating static and temporal models for atomic state classification. Counterintuitively, turning a Bidirectional RNN into a seq_length=1 static encoder yields substantial gains, achieving 97.60% accuracy and a balanced F1 of 0.90 for the challenging 'grabbing' state. This work establishes a robust, interpretable benchmark for low-level HOI state recognition using engineered features and lightweight architectures, offering a practical baseline for future exploration with more complex models like GNNs. It also highlights the potential of reinterpreting recurrent cells as static encoders when feature representations are rich enough.

Abstract

Reliably predicting human intent in hand-object interactions is an open challenge for computer vision. Our research concentrates on a fundamental sub-problem: the fine-grained classification of atomic interaction states, namely 'approaching', 'grabbing', and 'holding'. To this end, we introduce a structured data engineering process that converts raw videos from the MANIAC dataset into 27,476 statistical-kinematic feature vectors. Each vector encapsulates relational and dynamic properties from a short temporal window of motion. Our initial hypothesis posited that sequential modeling would be critical, leading us to compare static classifiers (MLPs) against temporal models (RNNs). Counter-intuitively, the key discovery occurred when we set the sequence length of a Bidirectional RNN to one (seq_length=1). This modification converted the network's function, compelling it to act as a high-capacity static feature encoder. This architectural change directly led to a significant accuracy improvement, culminating in a final score of 97.60%. Of particular note, our optimized model successfully overcame the most challenging transitional class, 'grabbing', by achieving a balanced F1-score of 0.90. These findings provide a new benchmark for low-level hand-object interaction recognition using structured, interpretable features and lightweight architectures.

Beyond Sequences: A Benchmark for Atomic Hand-Object Interaction Using a Static RNN Encoder

TL;DR

The paper tackles fine-grained anticipation of hand-object interactions by converting raw MANIAC videos into a structured, statistical–kinematic feature dataset of 27,476 samples, then evaluating static and temporal models for atomic state classification. Counterintuitively, turning a Bidirectional RNN into a seq_length=1 static encoder yields substantial gains, achieving 97.60% accuracy and a balanced F1 of 0.90 for the challenging 'grabbing' state. This work establishes a robust, interpretable benchmark for low-level HOI state recognition using engineered features and lightweight architectures, offering a practical baseline for future exploration with more complex models like GNNs. It also highlights the potential of reinterpreting recurrent cells as static encoders when feature representations are rich enough.

Abstract

Reliably predicting human intent in hand-object interactions is an open challenge for computer vision. Our research concentrates on a fundamental sub-problem: the fine-grained classification of atomic interaction states, namely 'approaching', 'grabbing', and 'holding'. To this end, we introduce a structured data engineering process that converts raw videos from the MANIAC dataset into 27,476 statistical-kinematic feature vectors. Each vector encapsulates relational and dynamic properties from a short temporal window of motion. Our initial hypothesis posited that sequential modeling would be critical, leading us to compare static classifiers (MLPs) against temporal models (RNNs). Counter-intuitively, the key discovery occurred when we set the sequence length of a Bidirectional RNN to one (seq_length=1). This modification converted the network's function, compelling it to act as a high-capacity static feature encoder. This architectural change directly led to a significant accuracy improvement, culminating in a final score of 97.60%. Of particular note, our optimized model successfully overcame the most challenging transitional class, 'grabbing', by achieving a balanced F1-score of 0.90. These findings provide a new benchmark for low-level hand-object interaction recognition using structured, interpretable features and lightweight architectures.

Paper Structure

This paper contains 15 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Overview of the methodology. (a) The six-stage data-engineering pipeline converts raw MANIAC videos into a structured feature corpus. (b) The experimental evolution traces eight successive modeling stages, from static baselines to the final champion model.
  • Figure 2: The six-stage pipeline used to construct the statistical–kinematic dataset from MANIAC.
  • Figure 3: Predictive sliding-window mechanism: a history of 10 keyframes is summarized into one statistical feature vector which is used to predict the label at the 11th keyframe.
  • Figure 4: Evolution of modeling experiments: from static MLPs through temporal RNNs to the final Optuna-optimized champion.
  • Figure 5: Champion architecture: an Optuna-optimized bidirectional RNN configured with seq_length=1, serving as a high-capacity static encoder for individual feature vectors.
  • ...and 1 more figures