Table of Contents
Fetching ...

Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction Anticipation

Razvan-George Pasca, Alexey Gavryushin, Muhammad Hamza, Yen-Ling Kuo, Kaichun Mo, Luc Van Gool, Otmar Hilliges, Xi Wang

TL;DR

This work tackles short-term object interaction anticipation in egocentric video by incorporating language-derived action context. It introduces TransFusion, a multimodal transformer that fuses language summaries of past actions with the current frame to predict next-active objects, their verbs/nouns, and time-to-contact. The approach leverages captioning models, hand-object detectors, and salient-object cues to build compact action-context sequences, which are then embedded via a SBERT language encoder and fused with visual features in a transformer-based fusion module. Across Ego4D and EPIC-KITCHENS-100, TransFusion yields substantial improvements over state-of-the-art methods, especially on long-tail classes, and demonstrates that language-based context can surpass purely visual cues with similar computational budgets. The results highlight the generalization power of language-informed context for video reasoning and point to future extensions incorporating motion cues and longer-horizon scenarios.

Abstract

We study object interaction anticipation in egocentric videos. This task requires an understanding of the spatio-temporal context formed by past actions on objects, coined action context. We propose TransFusion, a multimodal transformer-based architecture. It exploits the representational power of language by summarizing the action context. TransFusion leverages pre-trained image captioning and vision-language models to extract the action context from past video frames. This action context together with the next video frame is processed by the multimodal fusion module to forecast the next object interaction. Our model enables more efficient end-to-end learning. The large pre-trained language models add common sense and a generalisation capability. Experiments on Ego4D and EPIC-KITCHENS-100 show the effectiveness of our multimodal fusion model. They also highlight the benefits of using language-based context summaries in a task where vision seems to suffice. Our method outperforms state-of-the-art approaches by 40.4% in relative terms in overall mAP on the Ego4D test set. We validate the effectiveness of TransFusion via experiments on EPIC-KITCHENS-100. Video and code are available at https://eth-ait.github.io/transfusion-proj/.

Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction Anticipation

TL;DR

This work tackles short-term object interaction anticipation in egocentric video by incorporating language-derived action context. It introduces TransFusion, a multimodal transformer that fuses language summaries of past actions with the current frame to predict next-active objects, their verbs/nouns, and time-to-contact. The approach leverages captioning models, hand-object detectors, and salient-object cues to build compact action-context sequences, which are then embedded via a SBERT language encoder and fused with visual features in a transformer-based fusion module. Across Ego4D and EPIC-KITCHENS-100, TransFusion yields substantial improvements over state-of-the-art methods, especially on long-tail classes, and demonstrates that language-based context can surpass purely visual cues with similar computational budgets. The results highlight the generalization power of language-informed context for video reasoning and point to future extensions incorporating motion cues and longer-horizon scenarios.

Abstract

We study object interaction anticipation in egocentric videos. This task requires an understanding of the spatio-temporal context formed by past actions on objects, coined action context. We propose TransFusion, a multimodal transformer-based architecture. It exploits the representational power of language by summarizing the action context. TransFusion leverages pre-trained image captioning and vision-language models to extract the action context from past video frames. This action context together with the next video frame is processed by the multimodal fusion module to forecast the next object interaction. Our model enables more efficient end-to-end learning. The large pre-trained language models add common sense and a generalisation capability. Experiments on Ego4D and EPIC-KITCHENS-100 show the effectiveness of our multimodal fusion model. They also highlight the benefits of using language-based context summaries in a task where vision seems to suffice. Our method outperforms state-of-the-art approaches by 40.4% in relative terms in overall mAP on the Ego4D test set. We validate the effectiveness of TransFusion via experiments on EPIC-KITCHENS-100. Video and code are available at https://eth-ait.github.io/transfusion-proj/.
Paper Structure (46 sections, 8 equations, 20 figures, 17 tables)

This paper contains 46 sections, 8 equations, 20 figures, 17 tables.

Figures (20)

  • Figure 1: TransFusion: Multimodal fusion transformer for short-term object interaction anticipation. Given a video sequence of past observations, the object interaction anticipation task aims to predict a set of objects visible in the current frame that will be interacted with in the future, i.e. in the activation frame that is $\Delta$ frames away from the current prediction frame. Additionally, the task requires estimating the bounding box, the associated action described by a verb-noun pair, and the time to contact for each predicted object. We propose TransFusion, a multimodal fusion architecture that uses a language summaries of past actions to effectively predict future object interactions.
  • Figure 2: Overview of the TransFusion model. TransFusion takes the prediction frame and the action context summary as input and predicts the bounding box of the next-active object, the noun-verb pair describing the associated action, and the time to contact (TTC). Feature maps of different scales are extracted from the visual encoder and then fused via a multimodal fusion module with the encoded language features. Their output is then processed by a regular feature pyramid network (FPN), denoted as multi-scale region proposal networks, before being fed into the Faster R-CNN detector.
  • Figure 3: Multimodal fusion module. We first project the CNN feature map and the language tokens to a common dimensionality before adding the specific embeddings. We concatenate the visual and language tokens and feed them to $M$ self-attention layers. At the output, the fused visual tokens are projected back into the initial feature map shape.
  • Figure 4: Prediction of Noun-Verb in relation to time-to-contact. Histogram of the time-to-contact labels is shown in blue bars. Performance measured in Noun-Verb is plotted as a function of time-to-contact.
  • Figure 5: Classification performance on top/tail categories. We show the relative and absolute gains of TransFusion over Ego4D for Noun-Only and Verb-Only mAP (without the IOU constraint). Relative improvements are written on top of the red bars.
  • ...and 15 more figures