Table of Contents
Fetching ...

Semantic Communications with World Models

Peiwen Jiang, Jiajia Guo, Chao-Kai Wen, Shi Jin, Jun Zhang

TL;DR

This work addresses the high transmission overhead of semantic video communication under varying wireless channels by leveraging world foundation models (WFMs) to predict future frames from the current frame and textual guidance. The proposed framework integrates a depth-based feedback monitor, segmentation-assisted partial transmission, and an active trajectory-aware strategy to adaptively schedule transmissions, reducing bandwidth while preserving task-relevant semantics. Key contributions include a WFM-based prediction pipeline, full-transmission and partial-transmission modes with diffusion-model repair, a lightweight depth-feedback mechanism, and an LLM-driven active adaptation strategy. Simulation on KITTI demonstrates substantial bandwidth savings with competitive perceptual and task-level metrics, particularly when PartTr and depth feedback are employed, and the active strategy further improves reliability in mobile scenarios. Overall, the paper shows that combining WFMs with context-aware control can enable robust, bandwidth-efficient semantic communications in dynamic wireless environments as foundation models continue to evolve.

Abstract

Semantic communication is a promising technique for emerging wireless applications, which reduces transmission overhead by transmitting only task-relevant features instead of raw data. However, existing methods struggle under extremely low bandwidth and varying channel conditions, where corrupted or missing semantics lead to severe reconstruction errors. To resolve this difficulty, we propose a world foundation model (WFM)-aided semantic video transmission framework that leverages the predictive capability of WFMs to generate future frames based on the current frame and textual guidance. This design allows transmissions to be omitted when predictions remain reliable, thereby saving bandwidth. Through WFM's prediction, the key semantics are preserved, yet minor prediction errors tend to amplify over time. To mitigate issue, a lightweight depth-based feedback module is introduced to determine whether transmission of the current frame is needed. Apart from transmitting the entire frame, a segmentation-assisted partial transmission method is proposed to repair degraded frames, which can further balance performance and bandwidth cost. Furthermore, an active transmission strategy is developed for mobile scenarios by exploiting camera trajectory information and proactively scheduling transmissions before channel quality deteriorates. Simulation results show that the proposed framework significantly reduces transmission overhead while maintaining task performances across varying scenarios and channel conditions.

Semantic Communications with World Models

TL;DR

This work addresses the high transmission overhead of semantic video communication under varying wireless channels by leveraging world foundation models (WFMs) to predict future frames from the current frame and textual guidance. The proposed framework integrates a depth-based feedback monitor, segmentation-assisted partial transmission, and an active trajectory-aware strategy to adaptively schedule transmissions, reducing bandwidth while preserving task-relevant semantics. Key contributions include a WFM-based prediction pipeline, full-transmission and partial-transmission modes with diffusion-model repair, a lightweight depth-feedback mechanism, and an LLM-driven active adaptation strategy. Simulation on KITTI demonstrates substantial bandwidth savings with competitive perceptual and task-level metrics, particularly when PartTr and depth feedback are employed, and the active strategy further improves reliability in mobile scenarios. Overall, the paper shows that combining WFMs with context-aware control can enable robust, bandwidth-efficient semantic communications in dynamic wireless environments as foundation models continue to evolve.

Abstract

Semantic communication is a promising technique for emerging wireless applications, which reduces transmission overhead by transmitting only task-relevant features instead of raw data. However, existing methods struggle under extremely low bandwidth and varying channel conditions, where corrupted or missing semantics lead to severe reconstruction errors. To resolve this difficulty, we propose a world foundation model (WFM)-aided semantic video transmission framework that leverages the predictive capability of WFMs to generate future frames based on the current frame and textual guidance. This design allows transmissions to be omitted when predictions remain reliable, thereby saving bandwidth. Through WFM's prediction, the key semantics are preserved, yet minor prediction errors tend to amplify over time. To mitigate issue, a lightweight depth-based feedback module is introduced to determine whether transmission of the current frame is needed. Apart from transmitting the entire frame, a segmentation-assisted partial transmission method is proposed to repair degraded frames, which can further balance performance and bandwidth cost. Furthermore, an active transmission strategy is developed for mobile scenarios by exploiting camera trajectory information and proactively scheduling transmissions before channel quality deteriorates. Simulation results show that the proposed framework significantly reduces transmission overhead while maintaining task performances across varying scenarios and channel conditions.

Paper Structure

This paper contains 20 sections, 22 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Overview of the proposed framework.
  • Figure 2: (a) Architectures of two different transmission methods. (b) Detailed network of the related encoder and decoder.
  • Figure 3: Architecture of the proposed depth feedback, where the current depth map is fed back to decide whether transmission is required.
  • Figure 4: LLM-based active strategy generation.
  • Figure 5: Performances of the WFM-based prediction under different transmission content with a fixed interval. (a) LPIPS metric. (b) Depth metric.
  • ...and 6 more figures