A Low-Computational Video Synopsis Framework with a Standard Dataset

Ramtin Malekpour; M. Mehrdad Morsali; Hoda Mohammadzade

A Low-Computational Video Synopsis Framework with a Standard Dataset

Ramtin Malekpour, M. Mehrdad Morsali, Hoda Mohammadzade

TL;DR

This work tackles the lack of standardized evaluation for video synopsis by introducing SynoClip, a long, uncrowded, outdoor-mounted dataset with tube annotations, and presents Fast Greedy Synopsis (FGS), a low-computation video synopsis framework. The system comprises tube extraction with a lightweight empty-frame detector and a fast YOLOv8n-based detector, a tube grouping and greedy rearrangement strategy to preserve relationships and chronologies, and a segmentation-driven visualization pipeline for higher visual quality. Key contributions include a novel grouping-based tube rearrangement algorithm, a segmentation method within bounding boxes, and a normalized compression measure $NFR = \frac{FR \times Coverage}{100\times Density}$ to enable fair cross-video comparisons. Empirical results show that FGS achieves strong speed (average $1294$ fps) and enhanced compression, notably outperforming prior methods at higher collision levels (e.g., $11.85\times$ speedup and $1.85\times$ compression gain at $10\%$ collision), with a practical, openly available codebase and tutorials to facilitate adoption.

Abstract

Video synopsis is an efficient method for condensing surveillance videos. This technique begins with the detection and tracking of objects, followed by the creation of object tubes. These tubes consist of sequences, each containing chronologically ordered bounding boxes of a unique object. To generate a condensed video, the first step involves rearranging the object tubes to maximize the number of non-overlapping objects in each frame. Then, these tubes are stitched to a background image extracted from the source video. The lack of a standard dataset for the video synopsis task hinders the comparison of different video synopsis models. This paper addresses this issue by introducing a standard dataset, called SynoClip, designed specifically for the video synopsis task. SynoClip includes all the necessary features needed to evaluate various models directly and effectively. Additionally, this work introduces a video synopsis model, called FGS, with low computational cost. The model includes an empty-frame object detector to identify frames empty of any objects, facilitating efficient utilization of the deep object detector. Moreover, a tube grouping algorithm is proposed to maintain relationships among tubes in the synthesized video. This is followed by a greedy tube rearrangement algorithm, which efficiently determines the start time of each tube. Finally, the proposed model is evaluated using the proposed dataset. The source code, fine-tuned object detection model, and tutorials are available at https://github.com/Ramtin-ma/VideoSynopsis-FGS.

A Low-Computational Video Synopsis Framework with a Standard Dataset

TL;DR

to enable fair cross-video comparisons. Empirical results show that FGS achieves strong speed (average

fps) and enhanced compression, notably outperforming prior methods at higher collision levels (e.g.,

speedup and

compression gain at

collision), with a practical, openly available codebase and tutorials to facilitate adoption.

Abstract

Paper Structure (20 sections, 8 equations, 8 figures, 5 tables, 2 algorithms)

This paper contains 20 sections, 8 equations, 8 figures, 5 tables, 2 algorithms.

Introduction
Proposed Method
Tube Extraction
Deep Object Detector
Empty-frame Object Detector
Multi-object Tracker
Tube Rearrangement
Tube Grouping
Group Rearrangement
Visualization
Background Generation
Segmentation
Stitching
Metrics
Dataset
...and 5 more sections

Figures (8)

Figure 1: Condensed video production using the video synopsis approach.
Figure 2: The proposed video synopsis system.
Figure 3: Processing steps of the proposed empty-frame object detector.
Figure 4: Examples illustrating the necessity of tube grouping based on spatial distance and occlusion: (a) Two interacting tubes (green and blue boxes) with low average spatial distance in shared frames. (b) Two tubes (red and yellow boxes) with distinct trajectories but frequent overlaps.
Figure 5: Different steps in segmentation mask generation: (a) Mask 1: Absolute difference between the extracted background and the object's bounding box image, (b) Mask 2: Motion, (c) Sum of the two masks, (d) Binary mask obtained using determined threshold, and (e) Final mask.
...and 3 more figures

A Low-Computational Video Synopsis Framework with a Standard Dataset

TL;DR

Abstract

A Low-Computational Video Synopsis Framework with a Standard Dataset

Authors

TL;DR

Abstract

Table of Contents

Figures (8)