Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains
Zitian Tang, Rohan Myer Krishnan, Zhiqiu Yu, Chen Sun
TL;DR
Spacewalk-18 tackles the challenge of procedural video understanding in unseen domains by introducing a multimodal, long-form benchmark based on spacewalk footage. It defines two tasks, step recognition and video question answering, and provides 96 hours of densely annotated spacewalks with 455 animation-driven steps, along with an efficient labeling protocol. Extensive experiments reveal substantial gaps between state-of-the-art models and human performance, while demonstrating that summarization-based contextual adaptation can markedly boost results without fine-tuning. The work offers practical guidance on leveraging temporal context and multimodal signals to advance domain-generalization in long-form procedural video understanding, with implications for embodied robotics and beyond.
Abstract
Learning from (procedural) videos has increasingly served as a pathway for embodied agents to acquire skills from human demonstrations. To do this, video understanding models must be able to obtain structured understandings, such as the temporal segmentation of a demonstration into sequences of actions and skills, and to generalize the understandings to novel environments, tasks, and problem domains. In pursuit of this goal, we introduce Spacewalk-18, a benchmark containing two tasks: (1) step recognition and (2) video question answering, over a dataset of temporally segmented and labeled tasks in International Space Station spacewalk recordings. In tandem, the two tasks quantify a model's ability to: (1) generalize to novel domains; (2) utilize long temporal context and multimodal (e.g. visual and speech) information. Our extensive experimental analysis highlights the challenges of Spacewalk-18, but also suggests best practices for domain generalization and long-form understanding. Notably, we discover a promising adaptation via summarization technique that leads to significant performance improvement without model fine-tuning. The Spacewalk-18 benchmark is released at https://brown-palm.github.io/Spacewalk-18/.
