Table of Contents
Fetching ...

Promptus: Can Prompts Streaming Replace Video Streaming with Stable Diffusion

Jiangkai Wu, Liming Liu, Yunpeng Tan, Junlin Hao, Xinggong Zhang

TL;DR

Promptus replaces pixel-based video streaming with prompt-based semantic streaming by inverting frames into Stable Diffusion prompts and generating frames at the receiver. It combines a gradient-descent-based prompt fitting pipeline for pixel alignment, a low-rank adaptive bitrate mechanism, and an interpolation-aware inter-frame strategy to compress prompts over time. Across multiple domains and real network traces, Promptus achieves over 4x bandwidth reduction relative to H.265 at the same perceptual quality, with notable improvements at ultra-low bitrates and dramatic reductions in severely distorted frames. This work establishes a practical semantic video streaming paradigm and provides open-source tooling to enable reproducible exploration and deployment.

Abstract

With the exponential growth of video traffic, traditional video streaming systems are approaching their limits in compression efficiency and communication capacity. To further reduce bitrate while maintaining quality, we propose Promptus, a disruptive semantic communication system that streaming prompts instead of video content, which represents real-world video frames with a series of "prompts" for delivery and employs Stable Diffusion to generate videos at the receiver. To ensure that the generated video is pixel-aligned with the original video, a gradient descent-based prompt fitting framework is proposed. Further, a low-rank decomposition-based bitrate control algorithm is introduced to achieve adaptive bitrate. For inter-frame compression, an interpolation-aware fitting algorithm is proposed. Evaluations across various video genres demonstrate that, compared to H.265, Promptus can achieve more than a 4x bandwidth reduction while preserving the same perceptual quality. On the other hand, at extremely low bitrates, Promptus can enhance the perceptual quality by 0.139 and 0.118 (in LPIPS) compared to VAE and H.265, respectively, and decreases the ratio of severely distorted frames by 89.3% and 91.7%. Our work opens up a new paradigm for efficient video communication. Promptus is open-sourced at: https://github.com/JiangkaiWu/Promptus.

Promptus: Can Prompts Streaming Replace Video Streaming with Stable Diffusion

TL;DR

Promptus replaces pixel-based video streaming with prompt-based semantic streaming by inverting frames into Stable Diffusion prompts and generating frames at the receiver. It combines a gradient-descent-based prompt fitting pipeline for pixel alignment, a low-rank adaptive bitrate mechanism, and an interpolation-aware inter-frame strategy to compress prompts over time. Across multiple domains and real network traces, Promptus achieves over 4x bandwidth reduction relative to H.265 at the same perceptual quality, with notable improvements at ultra-low bitrates and dramatic reductions in severely distorted frames. This work establishes a practical semantic video streaming paradigm and provides open-source tooling to enable reproducible exploration and deployment.

Abstract

With the exponential growth of video traffic, traditional video streaming systems are approaching their limits in compression efficiency and communication capacity. To further reduce bitrate while maintaining quality, we propose Promptus, a disruptive semantic communication system that streaming prompts instead of video content, which represents real-world video frames with a series of "prompts" for delivery and employs Stable Diffusion to generate videos at the receiver. To ensure that the generated video is pixel-aligned with the original video, a gradient descent-based prompt fitting framework is proposed. Further, a low-rank decomposition-based bitrate control algorithm is introduced to achieve adaptive bitrate. For inter-frame compression, an interpolation-aware fitting algorithm is proposed. Evaluations across various video genres demonstrate that, compared to H.265, Promptus can achieve more than a 4x bandwidth reduction while preserving the same perceptual quality. On the other hand, at extremely low bitrates, Promptus can enhance the perceptual quality by 0.139 and 0.118 (in LPIPS) compared to VAE and H.265, respectively, and decreases the ratio of severely distorted frames by 89.3% and 91.7%. Our work opens up a new paradigm for efficient video communication. Promptus is open-sourced at: https://github.com/JiangkaiWu/Promptus.
Paper Structure (23 sections, 5 equations, 18 figures, 2 tables)

This paper contains 23 sections, 5 equations, 18 figures, 2 tables.

Figures (18)

  • Figure 1: Promptus can invert a given image into a prompt. Based on this prompt, Stable Diffusion can generate an almost identical image to the original. In contrast, existing methods can only generate semantically similar images, while the differences at the pixel level are substantial. In this way, Promptus streams prompts instead of videos, significantly reducing bandwidth overhead.
  • Figure 2: How does Stable Diffusion generate high-quality images from text prompts. The details are elaborated in §\ref{['sec:sd_background']}.
  • Figure 3: Workflow of Promptus's video to prompt inversion.
  • Figure 4: Visualization of prompt fitting results. (a) Using random noise as input. (b) Only using MSE as the loss function. (c) Ours. (d) Ground Truth. The results demonstrate that the noisy previous frame and the perceptual loss both contribute to the visual quality.
  • Figure 5: The prompt interpolation not only fully preserves the details but also successfully fit the motion.
  • ...and 13 more figures