Table of Contents
Fetching ...

Editable-DeepSC: Reliable Cross-Modal Semantic Communications for Facial Editing

Bin Chen, Wenbo Yu, Qinshan Zhang, Tianqu Zhuang, Yong Jiang, Shu-Tao Xia

TL;DR

Editable-DeepSC addresses the challenge of real-time cross-modal facial editing over noisy channels by integrating editing operations into the semantic communication pipeline. It combines GAN-inversion-based semantic coding with Joint Editing-Channel Coding and lightweight SNR-aware adapters to transmit only task-relevant facial semantics while enabling precise, user-guided edits. The method achieves superior editing fidelity and semantic preservation while dramatically reducing the Channel Bandwidth Ratio ($\rho$) compared to baselines, including under high-resolution ($1024\times1024$) and Out-Of-Distribution (OOD) settings. This approach enables efficient, interactive, language-guided facial editing over wireless links, with practical implications for real-time social-media and metaverse applications.

Abstract

Real-time computer vision (CV) plays a crucial role in various real-world applications, whose performance is highly dependent on communication networks. Nonetheless, the data-oriented characteristics of conventional communications often do not align with the special needs of real-time CV tasks. To alleviate this issue, the recently emerged semantic communications only transmit task-related semantic information and exhibit a promising landscape to address this problem. However, the communication challenges associated with Semantic Facial Editing, one of the most important real-time CV applications on social media, still remain largely unexplored. In this paper, we fill this gap by proposing Editable-DeepSC, a novel cross-modal semantic communication approach for facial editing. Firstly, we theoretically discuss different transmission schemes that separately handle communications and editings, and emphasize the necessity of Joint Editing-Channel Coding (JECC) via iterative attributes matching, which integrates editings into the communication chain to preserve more semantic mutual information. To compactly represent the high-dimensional data, we leverage inversion methods via pre-trained StyleGAN priors for semantic coding. To tackle the dynamic channel noise conditions, we propose SNR-aware channel coding via model fine-tuning. Extensive experiments indicate that Editable-DeepSC can achieve superior editings while significantly saving the transmission bandwidth, even under high-resolution and out-of-distribution (OOD) settings.

Editable-DeepSC: Reliable Cross-Modal Semantic Communications for Facial Editing

TL;DR

Editable-DeepSC addresses the challenge of real-time cross-modal facial editing over noisy channels by integrating editing operations into the semantic communication pipeline. It combines GAN-inversion-based semantic coding with Joint Editing-Channel Coding and lightweight SNR-aware adapters to transmit only task-relevant facial semantics while enabling precise, user-guided edits. The method achieves superior editing fidelity and semantic preservation while dramatically reducing the Channel Bandwidth Ratio () compared to baselines, including under high-resolution () and Out-Of-Distribution (OOD) settings. This approach enables efficient, interactive, language-guided facial editing over wireless links, with practical implications for real-time social-media and metaverse applications.

Abstract

Real-time computer vision (CV) plays a crucial role in various real-world applications, whose performance is highly dependent on communication networks. Nonetheless, the data-oriented characteristics of conventional communications often do not align with the special needs of real-time CV tasks. To alleviate this issue, the recently emerged semantic communications only transmit task-related semantic information and exhibit a promising landscape to address this problem. However, the communication challenges associated with Semantic Facial Editing, one of the most important real-time CV applications on social media, still remain largely unexplored. In this paper, we fill this gap by proposing Editable-DeepSC, a novel cross-modal semantic communication approach for facial editing. Firstly, we theoretically discuss different transmission schemes that separately handle communications and editings, and emphasize the necessity of Joint Editing-Channel Coding (JECC) via iterative attributes matching, which integrates editings into the communication chain to preserve more semantic mutual information. To compactly represent the high-dimensional data, we leverage inversion methods via pre-trained StyleGAN priors for semantic coding. To tackle the dynamic channel noise conditions, we propose SNR-aware channel coding via model fine-tuning. Extensive experiments indicate that Editable-DeepSC can achieve superior editings while significantly saving the transmission bandwidth, even under high-resolution and out-of-distribution (OOD) settings.

Paper Structure

This paper contains 19 sections, 1 theorem, 18 equations, 11 figures, 6 tables.

Key Result

Theorem 1

(Data Processing Inequality) If $U \rightarrow V \rightarrow W$, then $\mathcal{I}(U;V) \geqslant \mathcal{I}(U;W)$.

Figures (11)

  • Figure 1: Illustration of the dynamic semantic facial editing scenarios. During the transmission, users may wish to flexibly edit the original multimedia data according to their personal needs in a conversational and interactive way.
  • Figure 2: Overview of the proposed framework. Editable-DeepSC mainly consists of the Text Semantic Encoder, the Image Semantic Encoder, the Joint Editing-Channel Encoder, and the Joint Semantic-Channel Decoder, where channel noise corruptions from the real world are also taken into consideration.
  • Figure 3: Illustration of our SNR-aware channel coding based on model fine-tuning. We introduce two lightweight trainable adapters that do not change the shapes of inputs and outputs to the Joint Editing-Channel Encoder and the Joint Semantic-Channel Decoder. When fine-tuning the models, only the parameters of these two adapters are adjusted to capture the distribution of new noise conditions, and the rest parameters are frozen to avoid forgetting the previously learned priors from Section \ref{['sec:jecc']}.
  • Figure 4: Quantitative comparison of different methods on the CelebA dataset (resolution $128 \times 128$) for cross-modal language-driven image editing and transmission tasks. Note that $\downarrow$ indicates that the lower the metric, the better the performance, while $\uparrow$ indicates that the higher the metric, the better the performance.
  • Figure 5: Qualitative comparison of different methods on the CelebA dataset (resolution $128 \times 128$) for cross-modal language-driven image editing and transmission tasks at the SNR of $6$ dB. The instructive sentences for the $1$st and $2$nd rows are respectively "I kind of want the smile to be less obvious" and "Smile more".
  • ...and 6 more figures

Theorems & Definitions (1)

  • Theorem 1