Table of Contents
Fetching ...

A remark on conditional entropy

Adam Wang

TL;DR

The paper investigates conditional entropy for sequential data and demonstrates an approximate time-reversal invariance. It derives the key relation $H_p(S)-H_{ ilde{p}}(\tilde{S})=\log(p(\vec{x}_f))-\log(p(\vec{x}_l))\le C$, which implies an $O(1/N)$ convergence of the forward/backward entropy difference. It defines a practical learnability metric $\Delta H = \frac{1}{N}(H_M(S)-H_{\tilde{M}}(\tilde{S}))$ to quantify distributional shift and compare forward-versus-backward training. The note discusses extensions to continuous variables, potential non-sequential datasets, and the role of symmetric training in ensuring equality between forward and backward generators, with empirical validation on Enwik9 subsets.

Abstract

The following note proves that conditional entropy of a sequence is almost time-reversal invariant, specifically they only differ by a small constant factor dependent only upon the forward and backward models that the entropies are being calculated with respect to. This gives rise to a numerical value that quantifies learnability, as well as a methodology to control for distributional shift between datasets. Rough guidelines are given for practitioners.

A remark on conditional entropy

TL;DR

The paper investigates conditional entropy for sequential data and demonstrates an approximate time-reversal invariance. It derives the key relation , which implies an convergence of the forward/backward entropy difference. It defines a practical learnability metric to quantify distributional shift and compare forward-versus-backward training. The note discusses extensions to continuous variables, potential non-sequential datasets, and the role of symmetric training in ensuring equality between forward and backward generators, with empirical validation on Enwik9 subsets.

Abstract

The following note proves that conditional entropy of a sequence is almost time-reversal invariant, specifically they only differ by a small constant factor dependent only upon the forward and backward models that the entropies are being calculated with respect to. This gives rise to a numerical value that quantifies learnability, as well as a methodology to control for distributional shift between datasets. Rough guidelines are given for practitioners.
Paper Structure (2 sections, 1 theorem, 15 equations)

This paper contains 2 sections, 1 theorem, 15 equations.

Key Result

Theorem 1

For a sequential dataset $S$ of length $N$ generated by some process with well defined conditional and unconditional distribution, $p$, the difference between forward and backward conditional entropy are given by: Where $\vec{x}_f, \vec{x}_l$ are the first and last $n$-tuples of $S$ and $C$ is a constant dependent only upon $p$. In other words, the difference in average conditional entropy is $\m

Theorems & Definitions (2)

  • Theorem 1
  • proof