Table of Contents
Fetching ...

Scene Summarization: Clustering Scene Videos into Spatially Diverse Frames

Chao Chen, Mingzhi Zhu, Ankush Pratap Singh, Yu Yan, Felix Juefei-Xu, Chen Feng

TL;DR

This work defines scene summarization as condensing long scene trajectories into a small, spatially diverse set of keyframes to support global spatial reasoning. It introduces SceneSum, a two-stage, self-supervised (withOptional GT supervision) framework that first clusters frames using spatially aware features (preferably VPR-based) and then selects one representative keyframe per cluster via a sampling-based masked autoencoder with a contrastive objective. The approach yields more spatially informative summaries than traditional video summarizers, with strong results on Habitat-Sim and KITTI and demonstrated applicability to downstream tasks such as Q&A and Sim2Real transfer. The key contributions include the VPR-based clustering emphasis, a memory-efficient keyframe selection mechanism, and a Divergence-based evaluation metric for spatial diversity.

Abstract

Humans are remarkably efficient at forming spatial understanding from just a few visual observations. When browsing real estate or navigating unfamiliar spaces, they intuitively select a small set of views that summarize the spatial layout. Inspired by this ability, we introduce scene summarization, the task of condensing long, continuous scene videos into a compact set of spatially diverse keyframes that facilitate global spatial reasoning. Unlike conventional video summarization-which focuses on user-edited, fragmented clips and often ignores spatial continuity-our goal is to mimic how humans abstract spatial layout from sparse views. We propose SceneSum, a two-stage self-supervised pipeline that first clusters video frames using visual place recognition to promote spatial diversity, then selects representative keyframes from each cluster under resource constraints. When camera trajectories are available, a lightweight supervised loss further refines clustering and selection. Experiments on real and simulated indoor datasets show that SceneSum produces more spatially informative summaries and outperforms existing video summarization baselines.

Scene Summarization: Clustering Scene Videos into Spatially Diverse Frames

TL;DR

This work defines scene summarization as condensing long scene trajectories into a small, spatially diverse set of keyframes to support global spatial reasoning. It introduces SceneSum, a two-stage, self-supervised (withOptional GT supervision) framework that first clusters frames using spatially aware features (preferably VPR-based) and then selects one representative keyframe per cluster via a sampling-based masked autoencoder with a contrastive objective. The approach yields more spatially informative summaries than traditional video summarizers, with strong results on Habitat-Sim and KITTI and demonstrated applicability to downstream tasks such as Q&A and Sim2Real transfer. The key contributions include the VPR-based clustering emphasis, a memory-efficient keyframe selection mechanism, and a Divergence-based evaluation metric for spatial diversity.

Abstract

Humans are remarkably efficient at forming spatial understanding from just a few visual observations. When browsing real estate or navigating unfamiliar spaces, they intuitively select a small set of views that summarize the spatial layout. Inspired by this ability, we introduce scene summarization, the task of condensing long, continuous scene videos into a compact set of spatially diverse keyframes that facilitate global spatial reasoning. Unlike conventional video summarization-which focuses on user-edited, fragmented clips and often ignores spatial continuity-our goal is to mimic how humans abstract spatial layout from sparse views. We propose SceneSum, a two-stage self-supervised pipeline that first clusters video frames using visual place recognition to promote spatial diversity, then selects representative keyframes from each cluster under resource constraints. When camera trajectories are available, a lightweight supervised loss further refines clustering and selection. Experiments on real and simulated indoor datasets show that SceneSum produces more spatially informative summaries and outperforms existing video summarization baselines.
Paper Structure (22 sections, 7 equations, 18 figures, 6 tables)

This paper contains 22 sections, 7 equations, 18 figures, 6 tables.

Figures (18)

  • Figure 1: Scene summarization vs video summarization: orange box shows frames selected by scene summarization, blue box shows those by video summarization. The right side displays the camera trajectory and floorplan, with blue/orange nodes indicating video/scene summarization frames. We observe that scene summarization promotes spatial diversity in the summarized frames, unlike video summarization that may select frames in close spatial proximity (frame A and B).
  • Figure 2: Overview of SceneSum. Our approach consists of two main stages: (1) Clustering: we use a contrastive learning (CL) or visual place recognition (VPR) encoder to encode images into features, then cluster frames based on these features or ground truth odometry if available. (2) Keyframe Selection: we select the most representative and visually distinct frame from each cluster. If spatial information is available, an optional Ground Truth supervision stage can be enabled, switching the model from self-supervised to supervised by using pre-computed keyframes from ground truth trajectories to guide keyframe selection. Frames circled in green denote the selected keyframes.
  • Figure 3: t-SNE plots color-coded by the sequence order for contrastive-based (CL) and visual place recognition (VPR)-based clustering approaches. The first three columns represent the zero-shot feature space from their respective pretrained models; the last column shows the feature space after fine-tuning on the Gibson dataset. Even in the zero-shot setting, VPR approaches exhibit a more continuous feature space, reflecting the continuous trajectory nature, compared to CL approaches. This distinction becomes even more pronounced after fine-tuning the VPR models.
  • Figure 4: Selected keyframes in Habitat-Sim Dataset. We summarize $20$ keyframes of $7$ baselines on scene Stokes. All frames are color-coded by temporal order. Summarized keyframes are marked with red crosses. Groups of frames that are geographically close to each other are circled in yellow.
  • Figure 5: Selected keyframes in KITTI Dataset. We summarize $20$ keyframes of $7$ baselines on scene 0028. The baselines and annotations follow Fig. \ref{['fig:habitat_visualization']}
  • ...and 13 more figures