Table of Contents
Fetching ...

Multimodal Large Models Are Effective Action Anticipators

Binglu Wang, Yao Tian, Shunzhou Wang, Le Yang

TL;DR

This work tackles long-term action anticipation by leveraging Large Language Models (LLMs) to model extended temporal dynamics and action semantics. It introduces ActionLLM, a multimodal framework that treats video sequences as tokens and fuses visual and textual information through a Cross-Modality Interaction Block (CMIB), aided by an action-tuning module and a linear decoder to predict future actions. Key contributions include the CMIB for robust vision-text interaction, a feature adapter and action-tuning strategy for efficient LLM adaptation, and training objectives that combine past visual/textual cues with future action predictions. Empirical results on Breakfast and 50 Salads demonstrate state-of-the-art performance, highlighting the practical potential of integrating LLMs into multimodal, long-horizon action forecasting. The work points to a scalable direction for multimodal large models in sequential prediction tasks and motivates further exploration across diverse LLMs and efficiency-focused architectures.

Abstract

The task of long-term action anticipation demands solutions that can effectively model temporal dynamics over extended periods while deeply understanding the inherent semantics of actions. Traditional approaches, which primarily rely on recurrent units or Transformer layers to capture long-term dependencies, often fall short in addressing these challenges. Large Language Models (LLMs), with their robust sequential modeling capabilities and extensive commonsense knowledge, present new opportunities for long-term action anticipation. In this work, we introduce the ActionLLM framework, a novel approach that treats video sequences as successive tokens, leveraging LLMs to anticipate future actions. Our baseline model simplifies the LLM architecture by setting future tokens, incorporating an action tuning module, and reducing the textual decoder layer to a linear layer, enabling straightforward action prediction without the need for complex instructions or redundant descriptions. To further harness the commonsense reasoning of LLMs, we predict action categories for observed frames and use sequential textual clues to guide semantic understanding. In addition, we introduce a Cross-Modality Interaction Block, designed to explore the specificity within each modality and capture interactions between vision and textual modalities, thereby enhancing multimodal tuning. Extensive experiments on benchmark datasets demonstrate the superiority of the proposed ActionLLM framework, encouraging a promising direction to explore LLMs in the context of action anticipation. Code is available at https://github.com/2tianyao1/ActionLLM.git.

Multimodal Large Models Are Effective Action Anticipators

TL;DR

This work tackles long-term action anticipation by leveraging Large Language Models (LLMs) to model extended temporal dynamics and action semantics. It introduces ActionLLM, a multimodal framework that treats video sequences as tokens and fuses visual and textual information through a Cross-Modality Interaction Block (CMIB), aided by an action-tuning module and a linear decoder to predict future actions. Key contributions include the CMIB for robust vision-text interaction, a feature adapter and action-tuning strategy for efficient LLM adaptation, and training objectives that combine past visual/textual cues with future action predictions. Empirical results on Breakfast and 50 Salads demonstrate state-of-the-art performance, highlighting the practical potential of integrating LLMs into multimodal, long-horizon action forecasting. The work points to a scalable direction for multimodal large models in sequential prediction tasks and motivates further exploration across diverse LLMs and efficiency-focused architectures.

Abstract

The task of long-term action anticipation demands solutions that can effectively model temporal dynamics over extended periods while deeply understanding the inherent semantics of actions. Traditional approaches, which primarily rely on recurrent units or Transformer layers to capture long-term dependencies, often fall short in addressing these challenges. Large Language Models (LLMs), with their robust sequential modeling capabilities and extensive commonsense knowledge, present new opportunities for long-term action anticipation. In this work, we introduce the ActionLLM framework, a novel approach that treats video sequences as successive tokens, leveraging LLMs to anticipate future actions. Our baseline model simplifies the LLM architecture by setting future tokens, incorporating an action tuning module, and reducing the textual decoder layer to a linear layer, enabling straightforward action prediction without the need for complex instructions or redundant descriptions. To further harness the commonsense reasoning of LLMs, we predict action categories for observed frames and use sequential textual clues to guide semantic understanding. In addition, we introduce a Cross-Modality Interaction Block, designed to explore the specificity within each modality and capture interactions between vision and textual modalities, thereby enhancing multimodal tuning. Extensive experiments on benchmark datasets demonstrate the superiority of the proposed ActionLLM framework, encouraging a promising direction to explore LLMs in the context of action anticipation. Code is available at https://github.com/2tianyao1/ActionLLM.git.
Paper Structure (26 sections, 19 equations, 5 figures, 8 tables)

This paper contains 26 sections, 19 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: An application of ActionLLM for predicting long-term actions in kitchen scenarios. We leverage the LLM to explore the interdependencies among these actions. By integrating text labels with visual features, synchronized with frames, we achieve information fusion through a Cross-Modality Interaction Block. The text aligns with the LLM's input format, enhancing the utilization of its intrinsic commonsense for precise prediction of future action categories and their durations (D).
  • Figure 2: Network architecture of ActionLLM. (a) Feature Acquisition and Adaptation. The raw action label is tokenized to generate text tokens, which are then processed through the token embedding layer of the frozen LLM to extract text features. The visual I3D features and the preset query are passed through a feature adapter layer to align with text features. (b) Cross-Modality Interaction Attention (CMIA). CMIA employs self-attention and cross-attention mechanisms to thoroughly investigate the distinct characteristics of each modality and the inter-relationships between them. We use arrows of various shapes and colors to differentiate the flow of data: blue for textual features, green for visual features, and yellow for query processing. Dashed arrows highlight the processing of V in the attention mechanism to clarify the CMIA module's internal structure. (c) Cross-Modality Interaction Block (CMIB). CMIB outputs textual, visual, and action query features following modal fusion. (d) LLM Adaptation. The action tuning module is used to fine-tune the LLM to handle the long-term action anticipation. The multimodal down/up projection layer ensures compatibility with the input specifications of both CMIB and LLM. The outputs from the transformer layers of LLM are processed through their respective past classifier and future predictor to produce future actions.
  • Figure 3: Adaptation modules. (a) Feature Adapter. This module is responsible for harmonizing the dimensions and representations between visual and query features. (b) Action Tuning. This module focuses on optimizing and fine-tuning the architecture of LLMs to enhance their performance in specific downstream tasks.
  • Figure 4: We increase the number of action queries until performance degrades. For each query count, we use four prediction ratios (10%, 20%, 30%, and 50%), keeping the observation ratio fixed at 30%. Different colors indicate MoC values at each prediction ratio, and the line shows the average MoC change across queries.
  • Figure 5: Qualitative analysis. We conduct predictions on the future actions of two video examples, labeled (a) and (b), from the 50 Salads dataset using three distinct models: FUTR, LLMAction, and ActionLLM. These predictions are made under experimental settings with $\alpha$ = 0.2 and $\beta$ = 0.3. The transitions between past and future actions are indicated by dashed lines, with each color in each video representing a unique action.