Cross-Attention Transformer for Joint Multi-Receiver Uplink Neural Decoding
Xavier Tardy, Grégoire Lefebvre, Apostolos Kountouris, Haïfa Fares, Amor Nafkha
TL;DR
This work tackles multi-AP uplink decoding for OFDM by introducing a cross-attention Transformer that jointly processes observations from multiple coordinated APs. Each AP is first encoded by a shared self-attention encoder, and a token-wise anchor-based cross-attention module fuses AP views to produce per-receiver soft information in the form of log-likelihood ratios $L_i$, without requiring explicit per-AP CSI. Trained with a Bit-Metric Decoding objective, the model learns data-dependent fusion that adapts to per-AP reliability and remains robust under missing links and pilot sparsity, achieving BER gains over LS/LMMSE and CNN baselines, and approaching or surpassing a perfect CSI reference in higher cooperation regimes. The approach is compact ($0.15$M parameters, $0.24$ GFLOPs), offers low latency on GPUs, and demonstrates practical viability as a building block for next-generation Wi-Fi receivers across realistic 3GPP TR 38.901 UMi channels. This work thus provides a scalable, robust fusion mechanism for cooperative uplink reception with significant operational relevance for Wi-Fi 7/8 deployments.
Abstract
We propose a cross-attention Transformer for joint decoding of uplink OFDM signals received by multiple coordinated access points. A shared per-receiver encoder learns time-frequency structure within each received grid, and a token-wise cross-attention module fuses the receivers to produce soft log-likelihood ratios for a standard channel decoder, without requiring explicit per-receiver channel estimates. Trained with a bit-metric objective, the model adapts its fusion to per-receiver reliability, tolerates missing or degraded links, and remains robust when pilots are sparse. Across realistic Wi-Fi channels, it consistently outperforms classical pipelines and strong convolutional baselines, frequently matching (and in some cases surpassing) a powerful baseline that assumes perfect channel knowledge per access point. Despite its expressiveness, the architecture is compact, has low computational cost (low GFLOPs), and achieves low latency on GPUs, making it a practical building block for next-generation Wi-Fi receivers.
