Scene Summarization: Clustering Scene Videos into Spatially Diverse Frames
Chao Chen, Mingzhi Zhu, Ankush Pratap Singh, Yu Yan, Felix Juefei-Xu, Chen Feng
TL;DR
This work defines scene summarization as condensing long scene trajectories into a small, spatially diverse set of keyframes to support global spatial reasoning. It introduces SceneSum, a two-stage, self-supervised (withOptional GT supervision) framework that first clusters frames using spatially aware features (preferably VPR-based) and then selects one representative keyframe per cluster via a sampling-based masked autoencoder with a contrastive objective. The approach yields more spatially informative summaries than traditional video summarizers, with strong results on Habitat-Sim and KITTI and demonstrated applicability to downstream tasks such as Q&A and Sim2Real transfer. The key contributions include the VPR-based clustering emphasis, a memory-efficient keyframe selection mechanism, and a Divergence-based evaluation metric for spatial diversity.
Abstract
Humans are remarkably efficient at forming spatial understanding from just a few visual observations. When browsing real estate or navigating unfamiliar spaces, they intuitively select a small set of views that summarize the spatial layout. Inspired by this ability, we introduce scene summarization, the task of condensing long, continuous scene videos into a compact set of spatially diverse keyframes that facilitate global spatial reasoning. Unlike conventional video summarization-which focuses on user-edited, fragmented clips and often ignores spatial continuity-our goal is to mimic how humans abstract spatial layout from sparse views. We propose SceneSum, a two-stage self-supervised pipeline that first clusters video frames using visual place recognition to promote spatial diversity, then selects representative keyframes from each cluster under resource constraints. When camera trajectories are available, a lightweight supervised loss further refines clustering and selection. Experiments on real and simulated indoor datasets show that SceneSum produces more spatially informative summaries and outperforms existing video summarization baselines.
