Sliding Window Codes: Near-Optimality and Q-Learning for Zero-Delay Coding
Liam Cregg, Fady Alajaji, Serdar Yuksel
TL;DR
This work tackles zero-delay coding of a Markov source over a noisy channel with feedback by recasting the problem as an MDP with a probability-belief state and a quantizer action. It introduces a practical sliding finite window belief MDP that yields near-optimal policies with explicit performance bounds, and an RL algorithm (Q-learning) that provably converges to these near-optimal policies under predictor stability. An alternative belief-quantization scheme is analyzed and compared, with convergence results under invariant start conditions; both approaches provide rigorous guarantees that the learned policies achieve distortion within $\epsilon$ of the optimum for sufficiently large window length or discretization level. Simulations corroborate the theory, showing near-optimal performance and favorable trade-offs against memoryless encoding and Lloyd–Max-type baselines, with implications for average-cost settings as $\beta\to1$.
Abstract
We study the problem of zero-delay coding for the transmission of a Markov source over a noisy channel with feedback and present a reinforcement learning solution which is guaranteed to achieve near-optimality. To this end, we formulate the problem as a Markov decision process (MDP) where the state is a probability-measure valued predictor/belief and the actions are quantizer maps. This MDP formulation has been used to show the optimality of certain classes of encoder policies in prior work, but their computation is prohibitively complex due to the uncountable nature of the constructed state space and the lack of minorization or strong ergodicity results. These challenges invite rigorous reinforcement learning methods, which entail several open questions: can we approximate this MDP with a finite-state one with some performance guarantee? Can we ensure convergence of a reinforcement learning algorithm for this approximate MDP? What regularity assumptions are required for the above to hold? We address these questions as follows: we present an approximation of the belief MDP using a sliding finite window of channel outputs and quantizers. Under an appropriate notion of predictor stability, we show that policies based on this finite window are near-optimal, in the sense that the lowest distortion achievable by such a policy approaches the true lowest distortion as the window length increases. We give sufficient conditions for predictor stability to hold. Finally, we propose a Q-learning algorithm which provably converges to a near-optimal policy and provide a detailed comparison of~the sliding finite window scheme with another approximation scheme which quantizes the belief MDP in a nearest neighbor fashion.
