Table of Contents
Fetching ...

Go Beyond Earth: Understanding Human Actions and Scenes in Microgravity Environments

Di Wen, Lei Qi, Kunyu Peng, Kailun Yang, Fei Teng, Ao Luo, Jia Fu, Yufan Chen, Ruiping Liu, Yitian Shi, M. Saquib Sarfraz, Rainer Stiefelhagen

TL;DR

MicroG-4M introduces a first-of-its-kind microgravity video understanding benchmark, combining 4,759 three-second clips with 50 fine-grained action labels, 1,238 captions, and 7,428 QA pairs to support HAR, captioning, and VQA in space contexts. The dataset is built from authentic space mission footage and cinematic simulations, with a careful collection and annotation pipeline that includes automated bounding-box labeling and multiple human-validated QA and captions. Experiments show strong domain gaps when applying Earth-trained models to microgravity data, with notable degradation across HAR, captioning, and VQA tasks, and they demonstrate that longer temporal windows and microgravity-specific fine-tuning improve performance. The work provides a comprehensive benchmark and insight into the challenges of space-based perception, offering a platform to advance robust, domain-adapted vision-language systems for astronaut assistance and autonomous mission operations.

Abstract

Despite substantial progress in video understanding, most existing datasets are limited to Earth's gravitational conditions. However, microgravity alters human motion, interactions, and visual semantics, revealing a critical gap for real-world vision systems. This presents a challenge for domain-robust video understanding in safety-critical space applications. To address this, we introduce MicroG-4M, the first benchmark for spatio-temporal and semantic understanding of human activities in microgravity. Constructed from real-world space missions and cinematic simulations, the dataset includes 4,759 clips covering 50 actions, 1,238 context-rich captions, and over 7,000 question-answer pairs on astronaut activities and scene understanding. MicroG-4M supports three core tasks: fine-grained multi-label action recognition, temporal video captioning, and visual question answering, enabling a comprehensive evaluation of both spatial localization and semantic reasoning in microgravity contexts. We establish baselines using state-of-the-art models. All data, annotations, and code are available at https://github.com/LEI-QI-233/HAR-in-Space.

Go Beyond Earth: Understanding Human Actions and Scenes in Microgravity Environments

TL;DR

MicroG-4M introduces a first-of-its-kind microgravity video understanding benchmark, combining 4,759 three-second clips with 50 fine-grained action labels, 1,238 captions, and 7,428 QA pairs to support HAR, captioning, and VQA in space contexts. The dataset is built from authentic space mission footage and cinematic simulations, with a careful collection and annotation pipeline that includes automated bounding-box labeling and multiple human-validated QA and captions. Experiments show strong domain gaps when applying Earth-trained models to microgravity data, with notable degradation across HAR, captioning, and VQA tasks, and they demonstrate that longer temporal windows and microgravity-specific fine-tuning improve performance. The work provides a comprehensive benchmark and insight into the challenges of space-based perception, offering a platform to advance robust, domain-adapted vision-language systems for astronaut assistance and autonomous mission operations.

Abstract

Despite substantial progress in video understanding, most existing datasets are limited to Earth's gravitational conditions. However, microgravity alters human motion, interactions, and visual semantics, revealing a critical gap for real-world vision systems. This presents a challenge for domain-robust video understanding in safety-critical space applications. To address this, we introduce MicroG-4M, the first benchmark for spatio-temporal and semantic understanding of human activities in microgravity. Constructed from real-world space missions and cinematic simulations, the dataset includes 4,759 clips covering 50 actions, 1,238 context-rich captions, and over 7,000 question-answer pairs on astronaut activities and scene understanding. MicroG-4M supports three core tasks: fine-grained multi-label action recognition, temporal video captioning, and visual question answering, enabling a comprehensive evaluation of both spatial localization and semantic reasoning in microgravity contexts. We establish baselines using state-of-the-art models. All data, annotations, and code are available at https://github.com/LEI-QI-233/HAR-in-Space.

Paper Structure

This paper contains 24 sections, 8 figures, 10 tables.

Figures (8)

  • Figure 1: An illustration of the MicroG-4M, containing videos from real and simulated microgravity environments (e.g., movies). The dataset supports benchmarks for three tasks: (1) video captioning, (2) video question answering, and (3) fine-grained human action recognition under microgravity.
  • Figure 2: An illustration of the statistics of the dataset and the annotation samples. Word clouds of the (a) Caption, (b) Question, and (c) Answer from our MicroG-4M dataset are provided. The label statistics of the fine-grained human action recognition are provided in (d), which showcases the annotation number per action group (i.e., Object Manipulation (OM), Person Interaction (PI), Person Movement (PM)). The distribution of person counts per video clip is visualized in (e). On the bottom right, one annotation sample from MicroG-4M is provided.
  • Figure 3: Qualitative results for fine-grained human action recognition in microgravity, where GT denotes ground truth, MicroG-4M indicates predictions from Slow fine-tuned on MicroG-4M, and AVA denotes predictions from the same model fine-tuned on AVA. The MicroG-4M model provides more accurate predictions than its Earth-trained counterpart.
  • Figure 4: Qualitative results for fine-grained human action recognition in microgravity. Below the frame samples, the first row presents the ground truth labels of the actions. The second row presents the predictions of the Slow architecture fine-tuned on the AVA dataset. The third row shows the predictions of the Slow architecture fine-tuned on the MicroG-4M dataset. The last row shows the predictions of the I3D Non-Local Network (NLN) architecture fine-tuned on MicroG-4M. For both the Slow and I3D NLN architectures, the AVA and MicroG-4M models were trained under the same configuration: a ResNet-50 backbone with an 8×8 input (frame length × sampling rate), pre-trained on Kinetics-400. The I3D NLN model fine-tuned on MicroG-4M achieved the highest mAP among our baselines. Gray text denotes missed detections, while red text denotes false detections.
  • Figure 5: Qualitative examples illustrating ground-truth captions and outputs from four state-of-the-art multimodal models. The left example represents a challenging scenario, in which all models fail to accurately capture detailed and precise information. The right example demonstrates a relatively simpler scenario, where model-generated captions exhibit closer alignment to the ground truth. Each caption includes the corresponding caption length (in words), with key details highlighted in the ground-truth captions.
  • ...and 3 more figures