Table of Contents
Fetching ...

QueryMamba: A Mamba-Based Encoder-Decoder Architecture with a Statistical Verb-Noun Interaction Module for Video Action Forecasting @ Ego4D Long-Term Action Anticipation Challenge 2024

Zeyun Zhong, Manuel Martin, Frederik Diederichs, Juergen Beyerer

TL;DR

The paper tackles long-term action forecasting in egocentric video by predicting sequences of verb-noun actions. It introduces QueryMamba, a Mamba-based encoder-decoder that processes long-range visual context and uses a query-based decoder to anticipate future actions, enhanced by a verb-noun interaction module that exploits a dataset-specific co-occurrence prior to jointly sampling verbs and nouns. An action taxonomy of 7328 verb-noun pairs and VideoMAE-based visual features are employed, with memory spans of 64 seconds for long-term context and 30 seconds for short-term context. Empirical results on Ego4D LTA show the approach achieving second place on the challenge and the best noun prediction edit distance, underscoring the value of modeling verb-noun co-occurrence for more accurate sequence forecasting. The work highlights potential gains from combining co-occurrence priors with zero-shot language models to further improve generalization and commonsense reasoning in action forecasting.

Abstract

This report presents a novel Mamba-based encoder-decoder architecture, QueryMamba, featuring an integrated verb-noun interaction module that utilizes a statistical verb-noun co-occurrence matrix to enhance video action forecasting. This architecture not only predicts verbs and nouns likely to occur based on historical data but also considers their joint occurrence to improve forecast accuracy. The efficacy of this approach is substantiated by experimental results, with the method achieving second place in the Ego4D LTA challenge and ranking first in noun prediction accuracy.

QueryMamba: A Mamba-Based Encoder-Decoder Architecture with a Statistical Verb-Noun Interaction Module for Video Action Forecasting @ Ego4D Long-Term Action Anticipation Challenge 2024

TL;DR

The paper tackles long-term action forecasting in egocentric video by predicting sequences of verb-noun actions. It introduces QueryMamba, a Mamba-based encoder-decoder that processes long-range visual context and uses a query-based decoder to anticipate future actions, enhanced by a verb-noun interaction module that exploits a dataset-specific co-occurrence prior to jointly sampling verbs and nouns. An action taxonomy of 7328 verb-noun pairs and VideoMAE-based visual features are employed, with memory spans of 64 seconds for long-term context and 30 seconds for short-term context. Empirical results on Ego4D LTA show the approach achieving second place on the challenge and the best noun prediction edit distance, underscoring the value of modeling verb-noun co-occurrence for more accurate sequence forecasting. The work highlights potential gains from combining co-occurrence priors with zero-shot language models to further improve generalization and commonsense reasoning in action forecasting.

Abstract

This report presents a novel Mamba-based encoder-decoder architecture, QueryMamba, featuring an integrated verb-noun interaction module that utilizes a statistical verb-noun co-occurrence matrix to enhance video action forecasting. This architecture not only predicts verbs and nouns likely to occur based on historical data but also considers their joint occurrence to improve forecast accuracy. The efficacy of this approach is substantiated by experimental results, with the method achieving second place in the Ego4D LTA challenge and ranking first in noun prediction accuracy.
Paper Structure (11 sections, 4 equations, 2 figures, 2 tables)

This paper contains 11 sections, 4 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Schematic illustration of the proposed action forecasting architecture, featuring a dual-stage process with feature extraction via a video backbone and future anticipation using an encoder-decoder structure. The verb-noun interaction module leverages co-occurrence matrices to enhance predictive accuracy during inference by integrating statistical relationships between actions (verbs) and objects (nouns).
  • Figure 2: (a) The proposed model, QueryMamba, integrates long-term and short-term memories processed through the Mamba encoder. The future actions are anticipated by the decoder, utilizing static content embeddings $Q$ and learnable positional embeddings $Q_{pos}$, under the guidance of encoded past short-term memories $E_S$. (b) A Mamba block consists of multiple layers including linear layers, 1D convolutional layers, and SiLU (Sigmoid Linear Unit) activation functions, with the core being the SSM module.