Table of Contents
Fetching ...

SnapCap: Efficient Snapshot Compressive Video Captioning

Jianqiao Sun, Yudi Su, Hao Zhang, Ziheng Cheng, Zequn Zeng, Zhengjue Wang, Bo Chen, Xin Yuan

TL;DR

This work introduces SnapCap, a reconstruction-free video captioning framework that directly generates captions from coded measurements obtained via video snapshot compressive sensing. By distilling knowledge from a CLIP-based teacher into a measurement-domain student, SnapCap learns language-rich visual representations without reconstructing high‑fidelity frames, yielding substantial speedups over two-stage pipelines while maintaining competitive caption quality on standard VC benchmarks and enabling real-data applicability with CACTI data. The method uses a KD-driven visual encoder, a Transformer-based caption generator, and a two-stage training strategy that combines reconstruction regularization with distillation-guided learning. The results demonstrate strong efficiency gains, robustness to measurement settings, and practical potential for real-world streaming VC on compressed measurements.

Abstract

Video Captioning (VC) is a challenging multi-modal task since it requires describing the scene in language by understanding various and complex videos. For machines, the traditional VC follows the "imaging-compression-decoding-and-then-captioning" pipeline, where compression is pivot for storage and transmission. However, in such a pipeline, some potential shortcomings are inevitable, i.e., information redundancy resulting in low efficiency and information loss during the sampling process for captioning. To address these problems, in this paper, we propose a novel VC pipeline to generate captions directly from the compressed measurement, which can be captured by a snapshot compressive sensing camera and we dub our model SnapCap. To be more specific, benefiting from the signal simulation, we have access to obtain abundant measurement-video-annotation data pairs for our model. Besides, to better extract language-related visual representations from the compressed measurement, we propose to distill the knowledge from videos via a pre-trained CLIP with plentiful language-vision associations to guide the learning of our SnapCap. To demonstrate the effectiveness of SnapCap, we conduct experiments on two widely-used VC datasets. Both the qualitative and quantitative results verify the superiority of our pipeline over conventional VC pipelines. In particular, compared to the "caption-after-reconstruction" methods, our SnapCap can run at least 3$\times$ faster, and achieve better caption results.

SnapCap: Efficient Snapshot Compressive Video Captioning

TL;DR

This work introduces SnapCap, a reconstruction-free video captioning framework that directly generates captions from coded measurements obtained via video snapshot compressive sensing. By distilling knowledge from a CLIP-based teacher into a measurement-domain student, SnapCap learns language-rich visual representations without reconstructing high‑fidelity frames, yielding substantial speedups over two-stage pipelines while maintaining competitive caption quality on standard VC benchmarks and enabling real-data applicability with CACTI data. The method uses a KD-driven visual encoder, a Transformer-based caption generator, and a two-stage training strategy that combines reconstruction regularization with distillation-guided learning. The results demonstrate strong efficiency gains, robustness to measurement settings, and practical potential for real-world streaming VC on compressed measurements.

Abstract

Video Captioning (VC) is a challenging multi-modal task since it requires describing the scene in language by understanding various and complex videos. For machines, the traditional VC follows the "imaging-compression-decoding-and-then-captioning" pipeline, where compression is pivot for storage and transmission. However, in such a pipeline, some potential shortcomings are inevitable, i.e., information redundancy resulting in low efficiency and information loss during the sampling process for captioning. To address these problems, in this paper, we propose a novel VC pipeline to generate captions directly from the compressed measurement, which can be captured by a snapshot compressive sensing camera and we dub our model SnapCap. To be more specific, benefiting from the signal simulation, we have access to obtain abundant measurement-video-annotation data pairs for our model. Besides, to better extract language-related visual representations from the compressed measurement, we propose to distill the knowledge from videos via a pre-trained CLIP with plentiful language-vision associations to guide the learning of our SnapCap. To demonstrate the effectiveness of SnapCap, we conduct experiments on two widely-used VC datasets. Both the qualitative and quantitative results verify the superiority of our pipeline over conventional VC pipelines. In particular, compared to the "caption-after-reconstruction" methods, our SnapCap can run at least 3 faster, and achieve better caption results.
Paper Structure (22 sections, 13 equations, 10 figures, 6 tables, 2 algorithms)

This paper contains 22 sections, 13 equations, 10 figures, 6 tables, 2 algorithms.

Figures (10)

  • Figure 1: Comparing our novel video captioning pipeline in (c) with the traditional pipeline in (a) and a potential two-stage solution in (b), indicated by red, blue, and yellow, respectively.
  • Figure 2: Comparisons on GPU memory, inference time, and CIDEr score of typical VC methods, where red, blue, and yellow indicate our methods, traditional VC methods, and two-stage methods, respectively. The size of the circle is proportion to the CIDEr score ($\uparrow$) marked in brackets.
  • Figure 3: Illustration of a video snapshot CS system, CACTI cacti.
  • Figure 4: Learning and inference workflows of our proposed SnapCap. The cooperation of (a), (b), and (c) is for training, and only (b) is needed for an end-to-end captioning during testing.
  • Figure 5: Qualitative results on MSRVTT msrvtt (top row) and MSVD msvd (bottom row). We exhibit the compressed measurement, predicted caption by our SnapCap, and the ground truth. For a better understanding, we also show the ground truth video frames.
  • ...and 5 more figures