SnapCap: Efficient Snapshot Compressive Video Captioning

Jianqiao Sun; Yudi Su; Hao Zhang; Ziheng Cheng; Zequn Zeng; Zhengjue Wang; Bo Chen; Xin Yuan

SnapCap: Efficient Snapshot Compressive Video Captioning

Jianqiao Sun, Yudi Su, Hao Zhang, Ziheng Cheng, Zequn Zeng, Zhengjue Wang, Bo Chen, Xin Yuan

TL;DR

This work introduces SnapCap, a reconstruction-free video captioning framework that directly generates captions from coded measurements obtained via video snapshot compressive sensing. By distilling knowledge from a CLIP-based teacher into a measurement-domain student, SnapCap learns language-rich visual representations without reconstructing high‑fidelity frames, yielding substantial speedups over two-stage pipelines while maintaining competitive caption quality on standard VC benchmarks and enabling real-data applicability with CACTI data. The method uses a KD-driven visual encoder, a Transformer-based caption generator, and a two-stage training strategy that combines reconstruction regularization with distillation-guided learning. The results demonstrate strong efficiency gains, robustness to measurement settings, and practical potential for real-world streaming VC on compressed measurements.

Abstract

Video Captioning (VC) is a challenging multi-modal task since it requires describing the scene in language by understanding various and complex videos. For machines, the traditional VC follows the "imaging-compression-decoding-and-then-captioning" pipeline, where compression is pivot for storage and transmission. However, in such a pipeline, some potential shortcomings are inevitable, i.e., information redundancy resulting in low efficiency and information loss during the sampling process for captioning. To address these problems, in this paper, we propose a novel VC pipeline to generate captions directly from the compressed measurement, which can be captured by a snapshot compressive sensing camera and we dub our model SnapCap. To be more specific, benefiting from the signal simulation, we have access to obtain abundant measurement-video-annotation data pairs for our model. Besides, to better extract language-related visual representations from the compressed measurement, we propose to distill the knowledge from videos via a pre-trained CLIP with plentiful language-vision associations to guide the learning of our SnapCap. To demonstrate the effectiveness of SnapCap, we conduct experiments on two widely-used VC datasets. Both the qualitative and quantitative results verify the superiority of our pipeline over conventional VC pipelines. In particular, compared to the "caption-after-reconstruction" methods, our SnapCap can run at least 3$\times$ faster, and achieve better caption results.

SnapCap: Efficient Snapshot Compressive Video Captioning

TL;DR

Abstract

faster, and achieve better caption results.

Paper Structure (22 sections, 13 equations, 10 figures, 6 tables, 2 algorithms)

This paper contains 22 sections, 13 equations, 10 figures, 6 tables, 2 algorithms.

Introduction
Preliminary and Related Works
Video Snapshot Compressive Sensing
Video Captioning
Knowledge Distillation
Methodology
Visual Encoder via Knowledge Distillation
Caption Generator
Learning and Inference
Experiments
Experimental Settings
Comparison with VC Methods
Ablation Study
Comparison with two-stage methods
Effects of regularization and distillation
...and 7 more sections

Figures (10)

Figure 1: Comparing our novel video captioning pipeline in (c) with the traditional pipeline in (a) and a potential two-stage solution in (b), indicated by red, blue, and yellow, respectively.
Figure 2: Comparisons on GPU memory, inference time, and CIDEr score of typical VC methods, where red, blue, and yellow indicate our methods, traditional VC methods, and two-stage methods, respectively. The size of the circle is proportion to the CIDEr score ($\uparrow$) marked in brackets.
Figure 3: Illustration of a video snapshot CS system, CACTI cacti.
Figure 4: Learning and inference workflows of our proposed SnapCap. The cooperation of (a), (b), and (c) is for training, and only (b) is needed for an end-to-end captioning during testing.
Figure 5: Qualitative results on MSRVTT msrvtt (top row) and MSVD msvd (bottom row). We exhibit the compressed measurement, predicted caption by our SnapCap, and the ground truth. For a better understanding, we also show the ground truth video frames.
...and 5 more figures

SnapCap: Efficient Snapshot Compressive Video Captioning

TL;DR

Abstract

SnapCap: Efficient Snapshot Compressive Video Captioning

Authors

TL;DR

Abstract

Table of Contents

Figures (10)