Table of Contents
Fetching ...

Real-Time and Scalable Zak-OTFS Receiver Processing on GPUs

Junyao Zheng, Chung-Hsuan Tung, Yuncheng Yao, Nishant Mehrotra, Sandesh Mattu, Zhenzhou Qi, Danyang Zhuo, Robert Calderbank, Tingjun Chen

Abstract

Orthogonal time frequency space (OTFS) modulation offers superior robustness to high-mobility channels compared to conventional orthogonal frequency-division multiplexing (OFDM) waveforms. However, its explicit delay-Doppler (DD) domain representation incurs substantial signal processing complexity, especially with increased DD domain grid sizes. To address this challenge, we present a scalable, real-time Zak-OTFS receiver architecture on GPUs through hardware--algorithm co-design that exploits DD-domain channel sparsity. Our design leverages compact matrix operations for key processing stages, a branchless iterative equalizer, and a structured sparse channel matrix of the DD domain channel matrix to significantly reduce computational and memory overhead. These optimizations enable low-latency processing that consistently meets the 99.9-th percentile real-time processing deadline. The proposed system achieves up to 906.52 Mbps throughput with a DD grid size of (16384,32) using 16QAM modulation over 245.76 MHz bandwidth. Extensive evaluations under a Vehicular-A channel model demonstrate strong scalability and robust performance across CPU (Intel Xeon) and multiple GPU platforms (NVIDIA Jetson Orin, RTX 6000 Ada, A100, and H200), highlighting the effectiveness of compute-aware Zak-OTFS receiver design for next-generation (NextG) high-mobility communication systems.

Real-Time and Scalable Zak-OTFS Receiver Processing on GPUs

Abstract

Orthogonal time frequency space (OTFS) modulation offers superior robustness to high-mobility channels compared to conventional orthogonal frequency-division multiplexing (OFDM) waveforms. However, its explicit delay-Doppler (DD) domain representation incurs substantial signal processing complexity, especially with increased DD domain grid sizes. To address this challenge, we present a scalable, real-time Zak-OTFS receiver architecture on GPUs through hardware--algorithm co-design that exploits DD-domain channel sparsity. Our design leverages compact matrix operations for key processing stages, a branchless iterative equalizer, and a structured sparse channel matrix of the DD domain channel matrix to significantly reduce computational and memory overhead. These optimizations enable low-latency processing that consistently meets the 99.9-th percentile real-time processing deadline. The proposed system achieves up to 906.52 Mbps throughput with a DD grid size of (16384,32) using 16QAM modulation over 245.76 MHz bandwidth. Extensive evaluations under a Vehicular-A channel model demonstrate strong scalability and robust performance across CPU (Intel Xeon) and multiple GPU platforms (NVIDIA Jetson Orin, RTX 6000 Ada, A100, and H200), highlighting the effectiveness of compute-aware Zak-OTFS receiver design for next-generation (NextG) high-mobility communication systems.

Paper Structure

This paper contains 28 sections, 41 equations, 15 figures, 4 tables, 1 algorithm.

Figures (15)

  • Figure 1: Zak-OTFS signal processing pipeline based on discrete Zak transform (DZT). A point-pilot frame and a data frame are concatenated to form a packet. The simulated channel considers Veh-A itur_m1225 for paths with delay and Doppler shifts, and AWGN for SNR adjustment. The receiver pipeline stages include DZT, channel estimation, equalization, and demodulation, which will be discussed in Section \ref{['subsec:prelim_pipeline']}.
  • Figure 2: Measured compute latency and scalability on $\textbf{A} \in \mathbb{C}^{{N_\text{d}}\times{N_\text{d}}}$ across CPUand GPU platforms, including (a) matrix inversion, and (b) matrix-vector multiplication (MVM), with varying matrix dimension $N_\text{d}$. GPU-based matrix operations achieve lower latency and better scalability for ${N_\text{d}} > 256$.
  • Figure 3: The $\widehat{\textbf{h}}_{\textrm{eff}}$ and $\widehat{\textbf{H}}_{\textrm{dd}}$ magnitudes in $(M,N)=(8,2)$. The binarized versions depict the dominant entries using a threshold of $\theta = 0.12$, where any entry with magnitude below the threshold is set to zero. (a) The significant bins in $\widehat{\textbf{h}}_{\textrm{eff}}$ correspond to the dominant channel paths. (b) The channel matrix $\widehat{\textbf{H}}_{\textrm{dd}}$ is divided into $N\times N$ smaller sub-matrices of dimension $M\times M$. (c) Thresholding allows isolation of the most significant paths in $\widehat{\textbf{h}}_{\textrm{eff}}$, with each path at $({k}_{p},{l}_{p})$. In this example, two paths are located at (4, 1) in blue and at (0, 0) in orange. The blue path is associated with zero delay and zero Doppler shift, while the orange path corresponds to a delay shift of 0.017ms and a Doppler shift of 15kHz. (d) The resultant patterns in $\widehat{\textbf{H}}_{\textrm{dd}}$, where the colors associate the dominant path in binarized $\widehat{\textbf{h}}_{\textrm{eff}}$ and $\widehat{\textbf{H}}_{\textrm{dd}}$. For each row $q$ in $\widehat{\textbf{H}}_{\textrm{dd}}$, the dominant path $p$ maps to column ${r}_{p}(q)$ as a function of $({k}_{p},{l}_{p})$\ref{['eq:map_chMatDD_col']}.
  • Figure 4: BER and ${c_\textrm{norm}}$ vs. iterations in CGA for $(M,N)=(128,32)$, 16QAM. Across iterations, ${c_\textrm{norm}}$ exhibits monotonic decay with a diminishing slope, whereas BER behaves more variably, including flat (SNR = 0 dB) and rebound (SNR = 30 dB). In addition, ${c_\textrm{norm}}$ is exponentially proportional to SNR, while BER does not have this behavior.
  • Figure 5: Median end-to-end processing latency across hardware platforms (CPU, GPU), equalizers (LMMSE, MRC, and CGA), and structured-sparsity (SS) awareness, with $N=32$. The pilot/data frame duration of $T={1.067}$ms and processing deadline is $2T={2.134}$ms.
  • ...and 10 more figures