Table of Contents
Fetching ...

Simultaneous Detection and Interaction Reasoning for Object-Centric Action Recognition

Xunsong Li, Pengzhan Sun, Yangcen Liu, Lixin Duan, Wen Li

TL;DR

Object-centric action recognition traditionally relies on a two-stage pipeline with separate object detection and action reasoning. The paper introduces DAIR, an end-to-end framework that simultaneously Detects Active objects and conducts Interaction Reasoning through three modules—PatchDec for patch-based detection, IRA for interactive object refinement, and ORM for object relation modeling—built atop a shared video transformer backbone. DAIR demonstrates that actively interacting objects are crucial for accurate action understanding and achieves state-of-the-art results on Something-Else and IKEA-Assembly, outperforming detector-dependent baselines. The approach reduces training complexity, improves generalization to unseen objects, and provides a unified, efficient pipeline for video understanding in object-centric scenarios.

Abstract

The interactions between human and objects are important for recognizing object-centric actions. Existing methods usually adopt a two-stage pipeline, where object proposals are first detected using a pretrained detector, and then are fed to an action recognition model for extracting video features and learning the object relations for action recognition. However, since the action prior is unknown in the object detection stage, important objects could be easily overlooked, leading to inferior action recognition performance. In this paper, we propose an end-to-end object-centric action recognition framework that simultaneously performs Detection And Interaction Reasoning in one stage. Particularly, after extracting video features with a base network, we create three modules for concurrent object detection and interaction reasoning. First, a Patch-based Object Decoder generates proposals from video patch tokens. Then, an Interactive Object Refining and Aggregation identifies important objects for action recognition, adjusts proposal scores based on position and appearance, and aggregates object-level info into a global video representation. Lastly, an Object Relation Modeling module encodes object relations. These three modules together with the video feature extractor can be trained jointly in an end-to-end fashion, thus avoiding the heavy reliance on an off-the-shelf object detector, and reducing the multi-stage training burden. We conduct experiments on two datasets, Something-Else and Ikea-Assembly, to evaluate the performance of our proposed approach on conventional, compositional, and few-shot action recognition tasks. Through in-depth experimental analysis, we show the crucial role of interactive objects in learning for action recognition, and we can outperform state-of-the-art methods on both datasets.

Simultaneous Detection and Interaction Reasoning for Object-Centric Action Recognition

TL;DR

Object-centric action recognition traditionally relies on a two-stage pipeline with separate object detection and action reasoning. The paper introduces DAIR, an end-to-end framework that simultaneously Detects Active objects and conducts Interaction Reasoning through three modules—PatchDec for patch-based detection, IRA for interactive object refinement, and ORM for object relation modeling—built atop a shared video transformer backbone. DAIR demonstrates that actively interacting objects are crucial for accurate action understanding and achieves state-of-the-art results on Something-Else and IKEA-Assembly, outperforming detector-dependent baselines. The approach reduces training complexity, improves generalization to unseen objects, and provides a unified, efficient pipeline for video understanding in object-centric scenarios.

Abstract

The interactions between human and objects are important for recognizing object-centric actions. Existing methods usually adopt a two-stage pipeline, where object proposals are first detected using a pretrained detector, and then are fed to an action recognition model for extracting video features and learning the object relations for action recognition. However, since the action prior is unknown in the object detection stage, important objects could be easily overlooked, leading to inferior action recognition performance. In this paper, we propose an end-to-end object-centric action recognition framework that simultaneously performs Detection And Interaction Reasoning in one stage. Particularly, after extracting video features with a base network, we create three modules for concurrent object detection and interaction reasoning. First, a Patch-based Object Decoder generates proposals from video patch tokens. Then, an Interactive Object Refining and Aggregation identifies important objects for action recognition, adjusts proposal scores based on position and appearance, and aggregates object-level info into a global video representation. Lastly, an Object Relation Modeling module encodes object relations. These three modules together with the video feature extractor can be trained jointly in an end-to-end fashion, thus avoiding the heavy reliance on an off-the-shelf object detector, and reducing the multi-stage training burden. We conduct experiments on two datasets, Something-Else and Ikea-Assembly, to evaluate the performance of our proposed approach on conventional, compositional, and few-shot action recognition tasks. Through in-depth experimental analysis, we show the crucial role of interactive objects in learning for action recognition, and we can outperform state-of-the-art methods on both datasets.
Paper Structure (16 sections, 11 equations, 5 figures, 8 tables)

This paper contains 16 sections, 11 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Comparison between the previous two-stage pipeline (left figure) and our proposed DAIR (right figure). Previous works limited their research scope to only video feature extraction and object relation modeling, which relied on the output of an off-the-shelf object detector in the first stage. Our proposed DAIR incorporates object detection into the end-to-end model architecture, not only reducing multi-stage training cost but also allowing the object detector to focus on interactive objects in the action.
  • Figure 2: Illustration of interactive (blue boxes), non-interactive objects (yellow dashed boxes) and subjects (red boxes) in the action. The non-interactive objects could be the blur to action classification.
  • Figure 3: Overview of the proposed DAIR. It consists of three main components and takes dense sample frames as a transformer decoder implements input. PatchDec (Patch Decoder, can be instantiated by any video transformer), which responds to decoding objects from learnable queries after attending with patch tokens. IRA (Interactive Object Refining and Aggregation), which takes the instance-level representations of subjects and object proposals as input, and performs inter-relation reasoning to refine the confidence scores of objects. In ORM (Object Relation Modeling), subject and object features are comprised, and we take the [CLS] token as the final video representation to predict actions.
  • Figure 4: Visualization of the Attention Map comparison and the interactiveness score predicted by IRA. We compare the attention region of [CLS] token between DAIR and MViT (the second and third row). We also present the interactiveness score predicted by IRA (the first row). The ground truth objects (i.e., interactive objects) are drawn by colored boxes, and their corresponding interactiveness scores are displayed above images.
  • Figure 5: (a) Demonstration of the action classification accuracy (Top-1 Classification Acc and Top-5 Classification Acc) and detection mean average precision (Box Detection mAP) in the validation process, where the x-axis denotes the training epoch. The classification accuracy and detection mAP are in the same trend. (b) The visualization of each training objective in the training process. We visualize the training curve of and $\mathcal{L}_{Ir}$ and $\mathcal{L}_{Act}$, as well as $\mathcal{L}_{c}$, $\mathcal{L}_{b}$, and $\mathcal{L}_{u}$ in $\mathcal{L}_{Det}$. It demonstrates that all these training objectives could optimize classification and detection simultaneously.