A Low-Computational Video Synopsis Framework with a Standard Dataset
Ramtin Malekpour, M. Mehrdad Morsali, Hoda Mohammadzade
TL;DR
This work tackles the lack of standardized evaluation for video synopsis by introducing SynoClip, a long, uncrowded, outdoor-mounted dataset with tube annotations, and presents Fast Greedy Synopsis (FGS), a low-computation video synopsis framework. The system comprises tube extraction with a lightweight empty-frame detector and a fast YOLOv8n-based detector, a tube grouping and greedy rearrangement strategy to preserve relationships and chronologies, and a segmentation-driven visualization pipeline for higher visual quality. Key contributions include a novel grouping-based tube rearrangement algorithm, a segmentation method within bounding boxes, and a normalized compression measure $NFR = \frac{FR \times Coverage}{100\times Density}$ to enable fair cross-video comparisons. Empirical results show that FGS achieves strong speed (average $1294$ fps) and enhanced compression, notably outperforming prior methods at higher collision levels (e.g., $11.85\times$ speedup and $1.85\times$ compression gain at $10\%$ collision), with a practical, openly available codebase and tutorials to facilitate adoption.
Abstract
Video synopsis is an efficient method for condensing surveillance videos. This technique begins with the detection and tracking of objects, followed by the creation of object tubes. These tubes consist of sequences, each containing chronologically ordered bounding boxes of a unique object. To generate a condensed video, the first step involves rearranging the object tubes to maximize the number of non-overlapping objects in each frame. Then, these tubes are stitched to a background image extracted from the source video. The lack of a standard dataset for the video synopsis task hinders the comparison of different video synopsis models. This paper addresses this issue by introducing a standard dataset, called SynoClip, designed specifically for the video synopsis task. SynoClip includes all the necessary features needed to evaluate various models directly and effectively. Additionally, this work introduces a video synopsis model, called FGS, with low computational cost. The model includes an empty-frame object detector to identify frames empty of any objects, facilitating efficient utilization of the deep object detector. Moreover, a tube grouping algorithm is proposed to maintain relationships among tubes in the synthesized video. This is followed by a greedy tube rearrangement algorithm, which efficiently determines the start time of each tube. Finally, the proposed model is evaluated using the proposed dataset. The source code, fine-tuned object detection model, and tutorials are available at https://github.com/Ramtin-ma/VideoSynopsis-FGS.
