Cross-Modal Generative Semantic Communications for Mobile AIGC: Joint Semantic Encoding and Prompt Engineering

Yinqiu Liu; Hongyang Du; Dusit Niyato; Jiawen Kang; Zehui Xiong; Shiwen Mao; Ping Zhang; Xuemin Shen

Cross-Modal Generative Semantic Communications for Mobile AIGC: Joint Semantic Encoding and Prompt Engineering

Yinqiu Liu, Hongyang Du, Dusit Niyato, Jiawen Kang, Zehui Xiong, Shiwen Mao, Ping Zhang, Xuemin Shen

TL;DR

This work tackles the bandwidth bottleneck in mobile AIGC by introducing cross-modal Generative Semantic Communications (G-SemCom) that transmit only semantically critical content. It leverages cross-modal attention maps from diffusion-based AIGC to perform textual and visual prompt deconstruction, dependency-aware semantic extraction, and DBSCAN-based visual segmentation, enabling high-quality recovery with reduced data transfer. A novel Joint Perceptual Similarity and Quality (JPSQ) metric guides a joint semantic encoding and prompt engineering formulation, solved by Attention-aware Deep Diffusion (ADD) to optimize bandwidth allocation. Experimental results show about 49.4% average bandwidth reduction with minimal perceptual quality loss (NIMA drop ~0.03) and a 1.74x improvement in overall reward for ADD over baselines, underscoring the practical potential for bandwidth-efficient mobile AIGC services.

Abstract

Employing massive Mobile AI-Generated Content (AIGC) Service Providers (MASPs) with powerful models, high-quality AIGC services can become accessible for resource-constrained end users. However, this advancement, referred to as mobile AIGC, also introduces a significant challenge: users should download large AIGC outputs from the MASPs, leading to substantial bandwidth consumption and potential transmission failures. In this paper, we apply cross-modal Generative Semantic Communications (G-SemCom) in mobile AIGC to overcome wireless bandwidth constraints. Specifically, we utilize a series of cross-modal attention maps to indicate the correlation between user prompts and each part of AIGC outputs. In this way, the MASP can analyze the prompt context and filter the most semantically important content efficiently. Only semantic information is transmitted, with which users can recover the entire AIGC output with high quality while saving mobile bandwidth. Since the transmitted information not only preserves the semantics but also prompts the recovery, we formulate a joint semantic encoding and prompt engineering problem to optimize the bandwidth allocation among users. Particularly, we present a human-perceptual metric named Joint Perpetual Similarity and Quality (JPSQ), which is fused by two learning-based measurements regarding semantic similarity and aesthetic quality, respectively. Furthermore, we develop the Attention-aware Deep Diffusion (ADD) algorithm, which learns attention maps and leverages the diffusion process to enhance the environment exploration ability. Extensive experiments demonstrate that our proposal can reduce the bandwidth consumption of mobile users by 49.4% on average, with almost no perceptual difference in AIGC output quality. Moreover, the ADD algorithm shows superior performance over baseline DRL methods, with 1.74x higher overall reward.

Cross-Modal Generative Semantic Communications for Mobile AIGC: Joint Semantic Encoding and Prompt Engineering

TL;DR

Abstract

Paper Structure (40 sections, 21 equations, 13 figures, 4 tables, 2 algorithms)

This paper contains 40 sections, 21 equations, 13 figures, 4 tables, 2 algorithms.

Introduction
Related Work
Mobile AIGC
Generative Semantic Communications (G-SemCom)
Prompt Engineering
Motivation and System Model
Motivation
System Model
Transmission Model
Cross-Modal G-SemCom for Mobile AIGC
Source Image Generation
Cross-Modal Attention Map
Attention-Aware Semantic Extraction
Textual Prompt Deconstruction
Visual Prompt Segmentation
...and 25 more sections

Figures (13)

Figure 1: Local AIGC (left) and Traditional Mobile AIGC (right).
Figure 2: The system model. Step 1: Cross-modal attention map generation, Step 2: Attention-aware semantic extraction, and Step 3: Generative decoding.
Figure 3: The illustration of generating source images and cross-modal attention maps. (a): The CLIP and VAE modules. (b): The UNet architecture and diffusion process. (c): The attention map of word [car].
Figure 4: Attention-aware semantic extraction. (a): Dependency parsing. (b): Boolean dependency matrices $\boldsymbol{C}$ and $\boldsymbol{C}^{*}$. (c): Dependency level matrix $\boldsymbol{D}^{*}$ and $\otimes$ operation.
Figure 5: Illustration of attention clustering. (a): The raw and binary attention maps. (b): The illustration of the DBSCAN algorithm. (c): The clustering results. Note that noise and noisy clusters will be filtered out.
...and 8 more figures

Cross-Modal Generative Semantic Communications for Mobile AIGC: Joint Semantic Encoding and Prompt Engineering

TL;DR

Abstract

Cross-Modal Generative Semantic Communications for Mobile AIGC: Joint Semantic Encoding and Prompt Engineering

Authors

TL;DR

Abstract

Table of Contents

Figures (13)