Over-the-Air Semantic Alignment with Stacked Intelligent Metasurfaces
Mario Edoardo Pandolfo, Kyriakos Stylianopoulos, George C. Alexandropoulos, Paolo Di Lorenzo
TL;DR
This work tackles latent-space misalignment in semantic communications between heterogeneous encoders by proposing an over-the-air solution using stacked intelligent metasurfaces (SIM) to perform wave-domain semantic alignment. It introduces a gradient-based EM optimization framework that tunes the SIM transfer function to emulate both supervised linear semantic aligners and zero-shot Parseval-frame equalizers, enabling OTA interoperability. Through numerical experiments with ViT encoders on CIFAR-10, the authors show that larger SIMs yield high task accuracy (up to ~90%) at high SNR, with PPFE-based aligners offering greater robustness at low SNR. The study provides practical guidelines on SIM depth, layer size, and inter-layer spacing, highlighting SIMs as a promising, energy-efficient building block for AI-native semantic communications.
Abstract
Semantic communication systems aim to transmit task-relevant information between devices capable of artificial intelligence, but their performance can degrade when heterogeneous transmitter-receiver models produce misaligned latent representations. Existing semantic alignment methods typically rely on additional digital processing at the transmitter or receiver, increasing overall device complexity. In this work, we introduce the first over-the-air semantic alignment framework based on stacked intelligent metasurfaces (SIM), which enables latent-space alignment directly in the wave domain, reducing substantially the computational burden at the device level. We model SIMs as trainable linear operators capable of emulating both supervised linear aligners and zero-shot Parseval-frame-based equalizers. To realize these operators physically, we develop a gradient-based optimization procedure that tailors the metasurface transfer function to a desired semantic mapping. Experiments with heterogeneous vision transformer (ViT) encoders show that SIMs can accurately reproduce both supervised and zero-shot semantic equalizers, achieving up to 90% task accuracy in regimes with high signal-to-noise ratio (SNR), while maintaining strong robustness even at low SNR values.
