Table of Contents
Fetching ...

Mirage: Transmitting a Video as a Perceptual Illusion for 50,000X Speedup

Junjie Wu, Tianrui Li, Yi Zhang, Ziyuan Yang

TL;DR

This work rethinks video transmission by discarding pixel-level data in favor of compact semantic cues that drive receiver-side generative synthesis. Mirage splits video into temporal captions and spatial keyframes, transmits these via a semantic communication channel, and reconstructs video with a diffusion-based generator guided by personalized prompts and anchors. The approach achieves massive data and latency reductions (up to $5.18\times 10^4$ data-speedup in reported scenarios) while preserving semantic consistency, enabling privacy-preserving and customizable video delivery. By integrating sender/network/receiver personalization with end-to-end semantic representations and generation, Mirage offers a scalable path toward efficient, privacy-respecting video transmission in future networks.

Abstract

The existing communication framework mainly aims at accurate reconstruction of source signals to ensure reliable transmission. However, this signal-level fidelity-oriented design often incurs high communication overhead and system complexity, particularly in video communication scenarios where mainstream frameworks rely on transmitting visual data itself, resulting in significant bandwidth consumption. To address this issue, we propose a visual data-free communication framework, Mirage, for extremely efficient video transmission while preserving semantic information. Mirage decomposes video content into two complementary components: temporal sequence information capturing motion dynamics and spatial appearance representations describing overall visual structure. Temporal information is preserved through video captioning, while key frames are encoded into compact semantic representations for spatial appearance. These representations are transmitted to the receiver, where videos are synthesized using generative video models. Since no raw visual data is transmitted, Mirage is inherently privacy-preserving. Mirage also supports personalized adaptation across deployment scenarios. The sender, network, and receiver can independently impose constraints on semantic representation, transmission, and generation, enabling flexible trade-offs between efficiency, privacy, control, and perceptual quality. Experimental results in video transmission demonstrate that Mirage achieves up to a 50000X data-level compression speedup over raw video transmission, with gains expected to scale with larger video content sizes.

Mirage: Transmitting a Video as a Perceptual Illusion for 50,000X Speedup

TL;DR

This work rethinks video transmission by discarding pixel-level data in favor of compact semantic cues that drive receiver-side generative synthesis. Mirage splits video into temporal captions and spatial keyframes, transmits these via a semantic communication channel, and reconstructs video with a diffusion-based generator guided by personalized prompts and anchors. The approach achieves massive data and latency reductions (up to data-speedup in reported scenarios) while preserving semantic consistency, enabling privacy-preserving and customizable video delivery. By integrating sender/network/receiver personalization with end-to-end semantic representations and generation, Mirage offers a scalable path toward efficient, privacy-respecting video transmission in future networks.

Abstract

The existing communication framework mainly aims at accurate reconstruction of source signals to ensure reliable transmission. However, this signal-level fidelity-oriented design often incurs high communication overhead and system complexity, particularly in video communication scenarios where mainstream frameworks rely on transmitting visual data itself, resulting in significant bandwidth consumption. To address this issue, we propose a visual data-free communication framework, Mirage, for extremely efficient video transmission while preserving semantic information. Mirage decomposes video content into two complementary components: temporal sequence information capturing motion dynamics and spatial appearance representations describing overall visual structure. Temporal information is preserved through video captioning, while key frames are encoded into compact semantic representations for spatial appearance. These representations are transmitted to the receiver, where videos are synthesized using generative video models. Since no raw visual data is transmitted, Mirage is inherently privacy-preserving. Mirage also supports personalized adaptation across deployment scenarios. The sender, network, and receiver can independently impose constraints on semantic representation, transmission, and generation, enabling flexible trade-offs between efficiency, privacy, control, and perceptual quality. Experimental results in video transmission demonstrate that Mirage achieves up to a 50000X data-level compression speedup over raw video transmission, with gains expected to scale with larger video content sizes.
Paper Structure (30 sections, 19 equations, 10 figures, 5 tables)

This paper contains 30 sections, 19 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Overview of the Mirage architecture. Mirage consists of three components: a sender-side video understanding module that converts raw video into semantic representations, a network-side semantic communication module for efficient transmission, and a receiver-side video generation module for personalized synthesis. The figure illustrates the roles of the sender, the network, and the receiver, where semantic representation, adaptive transmission, and receiver-side generation together enable communication-efficient and personalized video delivery without transmitting visual data.
  • Figure 2: Sender-Side Video Understanding via Semantic Decomposition.
  • Figure 3: Semantic communication in Mirage. Instead of transmitting visual data, the sender converts video content into compact semantic representations consisting of textual descriptions and key-frame representations.
  • Figure 4: Receiver-side personalized generative reconstruction in Mirage. Upon receiving semantic payloads, including decoded keyframe semantics and textual prompts, the receiver performs semantic-conditioned video generation instead of signal-level decoding. Reconstructed keyframes are optionally augmented to improve robustness and diversity, while textual prompts can be adapted according to receiver-side preferences.
  • Figure 5: Raw Video Transmission under Varying Wireless Channel Conditions. Box plots summarize the distribution across video samples, where circles indicate outliers corresponding to rare channel realizations or severe reconstruction failures.
  • ...and 5 more figures