Table of Contents
Fetching ...

Leveraging Foundation Models for Multimodal Graph-Based Action Recognition

Fatemeh Ziaeetabar, Florentin Wörgötter

TL;DR

This work introduces a novel graph-based framework that integrates a vision-language foundation, leveraging VideoMAE for dynamic visual encoding and BERT for contextual textual embedding, to address the challenge of recognizing fine-grained bimanual manipulation actions.

Abstract

Foundation models have ushered in a new era for multimodal video understanding by enabling the extraction of rich spatiotemporal and semantic representations. In this work, we introduce a novel graph-based framework that integrates a vision-language foundation, leveraging VideoMAE for dynamic visual encoding and BERT for contextual textual embedding, to address the challenge of recognizing fine-grained bimanual manipulation actions. Departing from conventional static graph architectures, our approach constructs an adaptive multimodal graph where nodes represent frames, objects, and textual annotations, and edges encode spatial, temporal, and semantic relationships. These graph structures evolve dynamically based on learned interactions, allowing for flexible and context-aware reasoning. A task-specific attention mechanism within a Graph Attention Network further enhances this reasoning by modulating edge importance based on action semantics. Through extensive evaluations on diverse benchmark datasets, we demonstrate that our method consistently outperforms state-of-the-art baselines, underscoring the strength of combining foundation models with dynamic graph-based reasoning for robust and generalizable action recognition.

Leveraging Foundation Models for Multimodal Graph-Based Action Recognition

TL;DR

This work introduces a novel graph-based framework that integrates a vision-language foundation, leveraging VideoMAE for dynamic visual encoding and BERT for contextual textual embedding, to address the challenge of recognizing fine-grained bimanual manipulation actions.

Abstract

Foundation models have ushered in a new era for multimodal video understanding by enabling the extraction of rich spatiotemporal and semantic representations. In this work, we introduce a novel graph-based framework that integrates a vision-language foundation, leveraging VideoMAE for dynamic visual encoding and BERT for contextual textual embedding, to address the challenge of recognizing fine-grained bimanual manipulation actions. Departing from conventional static graph architectures, our approach constructs an adaptive multimodal graph where nodes represent frames, objects, and textual annotations, and edges encode spatial, temporal, and semantic relationships. These graph structures evolve dynamically based on learned interactions, allowing for flexible and context-aware reasoning. A task-specific attention mechanism within a Graph Attention Network further enhances this reasoning by modulating edge importance based on action semantics. Through extensive evaluations on diverse benchmark datasets, we demonstrate that our method consistently outperforms state-of-the-art baselines, underscoring the strength of combining foundation models with dynamic graph-based reasoning for robust and generalizable action recognition.

Paper Structure

This paper contains 51 sections, 16 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Overview of the proposed multimodal graph-based framework. The framework consists of three main stages: (1) multimodal feature extraction using VideoMAE for video and BERT for text, (2) multimodal dynamic graph construction where nodes (frames, objects, and text) and edges (temporal, spatial, semantic) are defined, and (3) graph-based multimodal reasoning using Graph Attention Networks (GAT) to refine representations for action classification.
  • Figure 2: Overview of the feature extraction and temporal aggregation process. The input video sequence, represented as frames $\{I_1, I_2, \dots, I_T\}$, is processed by VideoMAE to extract frame embeddings $\mathbf{f}_v^t$, which are represented as frame nodes. These nodes are connected in a temporal graph, where a self-attention mechanism captures long-range dependencies across time prior to graph-based reasoning. The resulting temporally aggregated embedding summarizes the video sequence for downstream tasks such as action recognition.
  • Figure 3: Workflow of text feature extraction using BERT. The input text (e.g., "grasping an object") is tokenized into subwords such as [CLS], grasping, an, object, and [SEP]. These tokens are passed through the BERT model to generate contextual embeddings for each token. The [CLS] token embedding is aggregated to produce the final text feature vector ($\mathbf{f}_t$), which is integrated into the multimodal graph as a text node. This node is connected to frame and object nodes through edges encoding semantic and temporal relationships.
  • Figure 4: Illustration of the multimodal graph construction process for the action "Pouring water into a glass." The top row shows three frames ($I_1, I_2, I_3$) from a video depicting the pouring action. Frame nodes ($I_1, I_2, I_3$) represent the temporal context of the video, capturing the sequence of events. Object nodes ($H$, $G$, $B$) correspond to the hand, glass, and bottle detected in each frame, enabling spatial reasoning. The text node ($T$) represents the semantic annotation, providing contextual information about the action. Edges in the graph encode different types of relationships: temporal edges (red dashed lines) connect consecutive frames, spatial edges (green solid and black dotted lines) capture frame-to-object and object-to-object dependencies, and semantic edges (blue dotted lines) link the text node to both frame and object nodes. This multimodal graph structure integrates temporal, spatial, and semantic information for robust action recognition.
  • Figure 5: Task-Specific Attention and Dynamic Graph Adaptation in Bimanual Manipulation (Cutting Bread). This figure shows evolving hand-object and object-object interactions across six frames. Nodes represent the right hand (RH), left hand (LH), knife (K), and bread (B). Edges adapt in strength based on contextual relevance as the cutting action progresses.