Table of Contents
Fetching ...

An Algorithm for Computing the Capacity of Symmetrized KL Information for Discrete Channels

Haobo Chen, Gholamali Aminian, Yuheng Bu

TL;DR

The paper tackles the problem of computing the capacity defined by the symmetrized KL information $I_{ ext{SKL}}(X;Y)$ for fixed discrete channels, which is challenging due to the non-concavity of Lautum information. It reformulates the problem as a discrete quadratic program $\max_{\boldsymbol{X} \in \Delta^{d-1}} \boldsymbol{X}^T \boldsymbol{C} \boldsymbol{X}$ with $C_{ij} = D(P_{Y|X=x_i} \| P_{Y|X=x_j})$ and introduces the Max-SKL algorithm that symmetrizes $\boldsymbol{C}$ to $\boldsymbol{C}_{\text{sym}}$ and updates $\boldsymbol{X}$ via a multiplicative, simplex-preserving rule, guaranteeing monotone improvement in $I_{ ext{SKL}}$. The method is validated on the Binary Symmetric Channel and Binomial Channel, showing excellent agreement with theoretical $C_{ ext{SKL}}$ values and revealing how the SKL capacity differs from mutual-information capacity. The framework is extended to Gibbs-channel learning, where $I_{ ext{SKL}}(W;S)$ characterizes the worst-case generalization error and the Max-SKL procedure identifies adversarial data inputs that maximize this quantity. Together, these results advance capacity estimation under symmetrized divergences and offer data-dependent insights for learning outcomes, with ongoing work to handle continuous inputs via random-matrix/mean-field methods.

Abstract

Symmetrized Kullback-Leibler (KL) information (\(I_{\mathrm{SKL}}\)), which symmetrizes the traditional mutual information by integrating Lautum information, has been shown as a critical quantity in communication~\cite{aminian2015capacity} and learning theory~\cite{aminian2023information}. This paper considers the problem of computing the capacity in terms of \(I_{\mathrm{SKL}}\) for a fixed discrete channel. Such a maximization problem is reformulated into a discrete quadratic optimization with a simplex constraint. One major challenge here is the non-concavity of Lautum information, which complicates the optimization problem. Our method involves symmetrizing the KL divergence matrix and applying iterative updates to ensure a non-decreasing update while maintaining a valid probability distribution. We validate our algorithm on Binary symmetric Channels and Binomial Channels, demonstrating its consistency with theoretical values. Additionally, we explore its application in machine learning through the Gibbs channel, showcasing the effectiveness of our algorithm in finding the worst-case data distributions.

An Algorithm for Computing the Capacity of Symmetrized KL Information for Discrete Channels

TL;DR

The paper tackles the problem of computing the capacity defined by the symmetrized KL information for fixed discrete channels, which is challenging due to the non-concavity of Lautum information. It reformulates the problem as a discrete quadratic program with and introduces the Max-SKL algorithm that symmetrizes to and updates via a multiplicative, simplex-preserving rule, guaranteeing monotone improvement in . The method is validated on the Binary Symmetric Channel and Binomial Channel, showing excellent agreement with theoretical values and revealing how the SKL capacity differs from mutual-information capacity. The framework is extended to Gibbs-channel learning, where characterizes the worst-case generalization error and the Max-SKL procedure identifies adversarial data inputs that maximize this quantity. Together, these results advance capacity estimation under symmetrized divergences and offer data-dependent insights for learning outcomes, with ongoing work to handle continuous inputs via random-matrix/mean-field methods.

Abstract

Symmetrized Kullback-Leibler (KL) information (), which symmetrizes the traditional mutual information by integrating Lautum information, has been shown as a critical quantity in communication~\cite{aminian2015capacity} and learning theory~\cite{aminian2023information}. This paper considers the problem of computing the capacity in terms of for a fixed discrete channel. Such a maximization problem is reformulated into a discrete quadratic optimization with a simplex constraint. One major challenge here is the non-concavity of Lautum information, which complicates the optimization problem. Our method involves symmetrizing the KL divergence matrix and applying iterative updates to ensure a non-decreasing update while maintaining a valid probability distribution. We validate our algorithm on Binary symmetric Channels and Binomial Channels, demonstrating its consistency with theoretical values. Additionally, we explore its application in machine learning through the Gibbs channel, showcasing the effectiveness of our algorithm in finding the worst-case data distributions.
Paper Structure (32 sections, 3 theorems, 42 equations, 11 figures, 4 tables, 1 algorithm)

This paper contains 32 sections, 3 theorems, 42 equations, 11 figures, 4 tables, 1 algorithm.

Key Result

Theorem 1

For a fixed channel $P_{Y|X}$, Lautum information $L(X;Y)$ is not concave with respect to the input distribution $P_X$.

Figures (11)

  • Figure 1: Comparison of theoretical values and calculated symmetric KL capacities using Max-SKL algorithm for the BSC. The experiment varies the channel distribution $p$ to validate that our algorithm can accurately compute the theoretical $I_{\mathrm{SKL}}$.
  • Figure 2: Convergence of Max-SKL and Power Iteration, and comparison with the result of the Blahut-Arimoto algorithm to calculate $I_{\mathrm{SKL}}$. Our algorithm shows successful convergence in the binomial channel, with the Max-SKL using a symmetrizing step demonstrating the best performance.
  • Figure 3: Matrix of KL Divergence $C_{sym}$ for $n=10$. The entries at 0.1 and 0.9 are the largest, leading to a concentrated distribution at these points to maximize $X^T CX$.
  • Figure 4: Linearly Separable Data Points (Case 1). The plots show the data points under two distributions: the initial distribution $P_{S_0}$ and the worst-case distribution $P_{S_1}$. In each plot, circles represent Class 1 ($y = 1$) and crosses represent Class -1 ($y = -1$). The initial model correctly fits the data points, but under the worst-case distribution, the fitting plane shifts, leading to misclassification.
  • Figure 5: Linearly Non-separable Data Points (Case 2). The initial data distribution $P_{S_0}$ is represented on the left. Under the worst-case data distribution $P_{S_1}$ on the right, the class labels have shifted.
  • ...and 6 more figures

Theorems & Definitions (4)

  • Theorem 1
  • Proposition 1
  • Proposition 2: Scheuer1959
  • Remark 1: Comparison with Power iteration algorithm