Table of Contents
Fetching ...

Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains

Zitian Tang, Rohan Myer Krishnan, Zhiqiu Yu, Chen Sun

TL;DR

Spacewalk-18 tackles the challenge of procedural video understanding in unseen domains by introducing a multimodal, long-form benchmark based on spacewalk footage. It defines two tasks, step recognition and video question answering, and provides 96 hours of densely annotated spacewalks with 455 animation-driven steps, along with an efficient labeling protocol. Extensive experiments reveal substantial gaps between state-of-the-art models and human performance, while demonstrating that summarization-based contextual adaptation can markedly boost results without fine-tuning. The work offers practical guidance on leveraging temporal context and multimodal signals to advance domain-generalization in long-form procedural video understanding, with implications for embodied robotics and beyond.

Abstract

Learning from (procedural) videos has increasingly served as a pathway for embodied agents to acquire skills from human demonstrations. To do this, video understanding models must be able to obtain structured understandings, such as the temporal segmentation of a demonstration into sequences of actions and skills, and to generalize the understandings to novel environments, tasks, and problem domains. In pursuit of this goal, we introduce Spacewalk-18, a benchmark containing two tasks: (1) step recognition and (2) video question answering, over a dataset of temporally segmented and labeled tasks in International Space Station spacewalk recordings. In tandem, the two tasks quantify a model's ability to: (1) generalize to novel domains; (2) utilize long temporal context and multimodal (e.g. visual and speech) information. Our extensive experimental analysis highlights the challenges of Spacewalk-18, but also suggests best practices for domain generalization and long-form understanding. Notably, we discover a promising adaptation via summarization technique that leads to significant performance improvement without model fine-tuning. The Spacewalk-18 benchmark is released at https://brown-palm.github.io/Spacewalk-18/.

Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains

TL;DR

Spacewalk-18 tackles the challenge of procedural video understanding in unseen domains by introducing a multimodal, long-form benchmark based on spacewalk footage. It defines two tasks, step recognition and video question answering, and provides 96 hours of densely annotated spacewalks with 455 animation-driven steps, along with an efficient labeling protocol. Extensive experiments reveal substantial gaps between state-of-the-art models and human performance, while demonstrating that summarization-based contextual adaptation can markedly boost results without fine-tuning. The work offers practical guidance on leveraging temporal context and multimodal signals to advance domain-generalization in long-form procedural video understanding, with implications for embodied robotics and beyond.

Abstract

Learning from (procedural) videos has increasingly served as a pathway for embodied agents to acquire skills from human demonstrations. To do this, video understanding models must be able to obtain structured understandings, such as the temporal segmentation of a demonstration into sequences of actions and skills, and to generalize the understandings to novel environments, tasks, and problem domains. In pursuit of this goal, we introduce Spacewalk-18, a benchmark containing two tasks: (1) step recognition and (2) video question answering, over a dataset of temporally segmented and labeled tasks in International Space Station spacewalk recordings. In tandem, the two tasks quantify a model's ability to: (1) generalize to novel domains; (2) utilize long temporal context and multimodal (e.g. visual and speech) information. Our extensive experimental analysis highlights the challenges of Spacewalk-18, but also suggests best practices for domain generalization and long-form understanding. Notably, we discover a promising adaptation via summarization technique that leads to significant performance improvement without model fine-tuning. The Spacewalk-18 benchmark is released at https://brown-palm.github.io/Spacewalk-18/.
Paper Structure (51 sections, 7 equations, 27 figures, 19 tables)

This paper contains 51 sections, 7 equations, 27 figures, 19 tables.

Figures (27)

  • Figure 1: Key properties of Spacewalk-18: (1) Domain generalization: a side-by-side comparison of a sample frame from Spacewalk-18 and Ego4D Ego4D illustrates our benchmark's novel domain. (2) Multimodal: the visual content of the left frame does not align with its audio/speech. Instead, the speech corresponding to the right frame describes the left frame. (3) Long-form: the left frame shows the astronaut working on the solar array and the right frame shows him releasing bolts. These contextualize each other to identify that he is releasing the solar array in both frames.
  • Figure 2: A spacewalk recording can be 7 or 8 hours long. The step recognition task aims to assign each video clip in the recording a step label, which is illustrated by a short animation and a text description. The question answering task targets video reasoning with long-term multimodal context. Both serve as intermediate benchmarks towards the "Goal", which aims to represent a long procedural video as a sequence of steps and their corresponding video demonstrations for understanding and reasoning.
  • Figure 3: Temporal certificate ("long-form-ness") lengths across commonly adopted datasets with action recognition and question answering annotations. Spacewalk-18 is 1.4x the length of the nearest comparable (EgoSchema). Figure adapted from EgoSchema.
  • Figure 4: Ablation on context length. We test the models under various context lengths. Contrastive VLMs are last-layer fine-tuned while MLLMs are zero-shot. When the temporal context is extremely long, the models can no longer benefit from it.
  • Figure 5: Performances of different temporal context incorporation methods built upon frozen InternVideo features. While LFB methods yield increasing mAP when the temporal context extends, one-time feed-forward models with either sparse or dense frame sampling cannot benefit from the context.
  • ...and 22 more figures