PRISM: Video Dataset Condensation with Progressive Refinement and Insertion for Sparse Motion
Jaehyun Choi, Jiwan Hur, Gyojin Han, Jaemyung Yu, Junmo Kim
TL;DR
PRISM introduces a holistic video dataset condensation approach that preserves the interdependence of content and motion by progressively refining a sparse set of key frames rather than disentangling them. It initializes each synthetic video with two endpoint frames and uses temporal interpolation to fill to length $T$, inserting frames via a gradient-guided criterion when gradient cosine similarity with neighboring key frames falls below a threshold $\epsilon$, while updating only the key frames. The optimization objective aligns the synthetic data with real data through gradient matching: $\min_{\theta,\{S_c^j\}} \sum_{c=1}^{C} \left\| \nabla_\theta \mathcal{L}_{\text{task}}(f_\theta(\mathcal{B}_c^{\text{syn}}), y_c) - \nabla_\theta \mathcal{L}_{\text{task}}(f_\theta(\mathcal{B}_c^{\text{real}}), y_c) \right\|_2^2$, updating only key frames. Extensive experiments on UCF101, HMDB51, Something-Something V2, and Kinetics-400 show PRISM achieves state-of-the-art performance with up to about 70–75% storage reduction and strong cross-architecture generalization, enabling scalable, storage-efficient condensation for resource-constrained deployment.
Abstract
Video dataset condensation has emerged as a critical technique for addressing the computational challenges associated with large-scale video data processing in deep learning applications. While significant progress has been made in image dataset condensation, the video domain presents unique challenges due to the complex interplay between spatial content and temporal dynamics. This paper introduces PRISM, Progressive Refinement and Insertion for Sparse Motion, for video dataset condensation, a novel approach that fundamentally reconsiders how video data should be condensed. Unlike the previous method that separates static content from dynamic motion, our method preserves the essential interdependence between these elements. Our approach progressively refines and inserts frames to fully accommodate the motion in an action while achieving better performance but less storage, considering the relation of gradients for each frame. Extensive experiments across standard video action recognition benchmarks demonstrate that PRISM outperforms existing disentangled approaches while maintaining compact representations suitable for resource-constrained environments.
