Table of Contents
Fetching ...

Over-the-Air Semantic Alignment with Stacked Intelligent Metasurfaces

Mario Edoardo Pandolfo, Kyriakos Stylianopoulos, George C. Alexandropoulos, Paolo Di Lorenzo

TL;DR

This work tackles latent-space misalignment in semantic communications between heterogeneous encoders by proposing an over-the-air solution using stacked intelligent metasurfaces (SIM) to perform wave-domain semantic alignment. It introduces a gradient-based EM optimization framework that tunes the SIM transfer function to emulate both supervised linear semantic aligners and zero-shot Parseval-frame equalizers, enabling OTA interoperability. Through numerical experiments with ViT encoders on CIFAR-10, the authors show that larger SIMs yield high task accuracy (up to ~90%) at high SNR, with PPFE-based aligners offering greater robustness at low SNR. The study provides practical guidelines on SIM depth, layer size, and inter-layer spacing, highlighting SIMs as a promising, energy-efficient building block for AI-native semantic communications.

Abstract

Semantic communication systems aim to transmit task-relevant information between devices capable of artificial intelligence, but their performance can degrade when heterogeneous transmitter-receiver models produce misaligned latent representations. Existing semantic alignment methods typically rely on additional digital processing at the transmitter or receiver, increasing overall device complexity. In this work, we introduce the first over-the-air semantic alignment framework based on stacked intelligent metasurfaces (SIM), which enables latent-space alignment directly in the wave domain, reducing substantially the computational burden at the device level. We model SIMs as trainable linear operators capable of emulating both supervised linear aligners and zero-shot Parseval-frame-based equalizers. To realize these operators physically, we develop a gradient-based optimization procedure that tailors the metasurface transfer function to a desired semantic mapping. Experiments with heterogeneous vision transformer (ViT) encoders show that SIMs can accurately reproduce both supervised and zero-shot semantic equalizers, achieving up to 90% task accuracy in regimes with high signal-to-noise ratio (SNR), while maintaining strong robustness even at low SNR values.

Over-the-Air Semantic Alignment with Stacked Intelligent Metasurfaces

TL;DR

This work tackles latent-space misalignment in semantic communications between heterogeneous encoders by proposing an over-the-air solution using stacked intelligent metasurfaces (SIM) to perform wave-domain semantic alignment. It introduces a gradient-based EM optimization framework that tunes the SIM transfer function to emulate both supervised linear semantic aligners and zero-shot Parseval-frame equalizers, enabling OTA interoperability. Through numerical experiments with ViT encoders on CIFAR-10, the authors show that larger SIMs yield high task accuracy (up to ~90%) at high SNR, with PPFE-based aligners offering greater robustness at low SNR. The study provides practical guidelines on SIM depth, layer size, and inter-layer spacing, highlighting SIMs as a promising, energy-efficient building block for AI-native semantic communications.

Abstract

Semantic communication systems aim to transmit task-relevant information between devices capable of artificial intelligence, but their performance can degrade when heterogeneous transmitter-receiver models produce misaligned latent representations. Existing semantic alignment methods typically rely on additional digital processing at the transmitter or receiver, increasing overall device complexity. In this work, we introduce the first over-the-air semantic alignment framework based on stacked intelligent metasurfaces (SIM), which enables latent-space alignment directly in the wave domain, reducing substantially the computational burden at the device level. We model SIMs as trainable linear operators capable of emulating both supervised linear aligners and zero-shot Parseval-frame-based equalizers. To realize these operators physically, we develop a gradient-based optimization procedure that tailors the metasurface transfer function to a desired semantic mapping. Experiments with heterogeneous vision transformer (ViT) encoders show that SIMs can accurately reproduce both supervised and zero-shot semantic equalizers, achieving up to 90% task accuracy in regimes with high signal-to-noise ratio (SNR), while maintaining strong robustness even at low SNR values.

Paper Structure

This paper contains 9 sections, 14 equations, 4 figures, 1 algorithm.

Figures (4)

  • Figure 1: The proposed SC model: The SIM module performs OTA semantic equalization on the complex, compressed, and pre-whitened latent representation of TX before the transmission through a MIMO channel $\mathbf{H}$ with noise $\mathbf{v}$. At the RX side, channel equalization is first applied followed by decoding, in which the received signal is re-colored and decompressed to recover the message in the original RX latent space representation.
  • Figure 2: Accuracy versus $L$, considering infinite ${\text{SNR}_\text{[dB]}}$.
  • Figure 3: Accuracy versus ${\text{SNR}_\text{[dB]}}$, considering $L=10$.
  • Figure 4: Accuracy versus $s_{\mathrm{layer}}$, considering only PPFE alignment.