Table of Contents
Fetching ...

Modelling Spatio-Temporal Interactions For Compositional Action Recognition

Ramanathan Rajendiran, Debaditya Roy, Basura Fernando

TL;DR

This work proposes an interaction model that captures both fine-grained and long-range interactions between hands and objects and infuses the interaction tokens with global motion information from video tokens to provide additional contextual cues to differentiate similar actions.

Abstract

Humans have the natural ability to recognize actions even if the objects involved in the action or the background are changed. Humans can abstract away the action from the appearance of the objects which is referred to as compositionality of actions. We focus on this compositional aspect of action recognition to impart human-like generalization abilities to video action-recognition models. First, we propose an interaction model that captures both fine-grained and long-range interactions between hands and objects. Frame-wise hand-object interactions capture fine-grained movements, while long-range interactions capture broader context and disambiguate actions across time. Second, in order to provide additional contextual cues to differentiate similar actions, we infuse the interaction tokens with global motion information from video tokens. The final global motion refined interaction tokens are used for compositional action recognition. We show the effectiveness of our interaction-centric approach on the compositional Something-Else dataset where we obtain a new state-of-the-art result outperforming recent object-centric methods by a significant margin.

Modelling Spatio-Temporal Interactions For Compositional Action Recognition

TL;DR

This work proposes an interaction model that captures both fine-grained and long-range interactions between hands and objects and infuses the interaction tokens with global motion information from video tokens to provide additional contextual cues to differentiate similar actions.

Abstract

Humans have the natural ability to recognize actions even if the objects involved in the action or the background are changed. Humans can abstract away the action from the appearance of the objects which is referred to as compositionality of actions. We focus on this compositional aspect of action recognition to impart human-like generalization abilities to video action-recognition models. First, we propose an interaction model that captures both fine-grained and long-range interactions between hands and objects. Frame-wise hand-object interactions capture fine-grained movements, while long-range interactions capture broader context and disambiguate actions across time. Second, in order to provide additional contextual cues to differentiate similar actions, we infuse the interaction tokens with global motion information from video tokens. The final global motion refined interaction tokens are used for compositional action recognition. We show the effectiveness of our interaction-centric approach on the compositional Something-Else dataset where we obtain a new state-of-the-art result outperforming recent object-centric methods by a significant margin.
Paper Structure (17 sections, 13 equations, 5 figures, 3 tables)

This paper contains 17 sections, 13 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: In order to recognize the compositional action letting something roll down a slanted surface, one needs to pay attention to the local interactions involving the spatial positions and appearance features of the hand and the active object(s) in the scene. The local interaction tokens (yellow blocks) encode the changes in the coordinates and appearance features extracted from the hand (red) and the active object (blue). Furthermore, for the action rolling down, the context of the slanted surface of the car differentiates the motion from other similar actions such as moving something down or lifting up one end of something, then letting it drop down. Hence the local interaction tokens need to be enriched with the global motion information from the video tokens (green blocks).
  • Figure 2: We first extract the hand and object features (coordinate and appearance) from 2D frames. The Frame Interaction Encoder encodes fine-grained hand-object interactions while the Trajectory Interaction Encoder encodes long-range interactions between hand and object trajectories. The output of these two interactions is fused to form Spatio-Temporal-Interaction $\textit{STI tokens}$. $\textit{STI tokens}$ are then refined with global motion information from video tokens. We use the global motion infused $\textit{STI tokens}$ for compositional action recognition.
  • Figure 3: Block diagram of the Fused Frame-Trajectory Interaction Encoder. Fine-grained interactions between the hand and the object features in every frame are modelled via the Frame Interaction Encoder. Long-range interactions between the hand and the object feature trajectories across the video are modelled using the Trajectory Interaction Encoder. The outputs of the two parallel encoders are fused using an MLP to obtain the $\textit{STI tokens}$.
  • Figure 4: $\textit{STI tokens}$ are infused with global motion via the Global Motion Infusion Transformer. By matching the STI token queries with the video token keys, the spatial attention operation first computes the best location for the STI token trajectories. Next, the temporal attention operation performs pooling of the interaction trajectories across time to accumulate the temporal information in $\textit{STI tokens}$. The global motion infused $\textit{STI tokens}$ are then used for compositional action recognition. $\mathcal{S}$ denotes the softmax function, $<,>$ denotes the inner-product operator and $X$ the denotes weighted sum operator.
  • Figure 5: Visualization of the attention of the final class token output on all the spatial tokens. In actions involving {(3)Moving, (5)Pushing, (7)Twisting} something, our method focusses on the interaction regions. In {(1)Dropping, (2)Spilling, (4)Pouring} something into something actions, our method pays high attention to both the objects involved in the interaction. In (6)Tearing something into two pieces, our method attends to both the pieces after the object is torn.