Discrete Markov Bridge
Hengli Li, Yuxuan Wang, Song-Chun Zhu, Ying Nian Wu, Zilong Zheng
TL;DR
The paper introduces the Discrete Markov Bridge (DMB), a variational framework for learning discrete representations by unifying discrete diffusion with latent-variable learning. It decomposes learning into forward Matrix-learning, which learns an adaptive, diagonalizable rate-transition matrix $Q_\alpha$, and backward Score-learning, which trains a neural score to construct the inverse dynamics, optimizing a continuous-time $ELBO$. The authors provide formal guarantees for the forward process (validity and accessibility) and convergence of the overall CTDMB algorithm, plus practical strategies for efficient matrix exponentiation and space usage. Empirically, the method achieves an $ELBO$ of $1.38$ on Text8 and delivers competitive results on CIFAR-10, illustrating its effectiveness and versatility for discrete data modalities. Overall, DMB offers a principled, scalable approach to discrete representation learning with strong theoretical foundations and broad applicability.
Abstract
Discrete diffusion has recently emerged as a promising paradigm in discrete data modeling. However, existing methods typically rely on a fixed rate transition matrix during training, which not only limits the expressiveness of latent representations, a fundamental strength of variational methods, but also constrains the overall design space. To address these limitations, we propose Discrete Markov Bridge, a novel framework specifically designed for discrete representation learning. Our approach is built upon two key components: Matrix Learning and Score Learning. We conduct a rigorous theoretical analysis, establishing formal performance guarantees for Matrix Learning and proving the convergence of the overall framework. Furthermore, we analyze the space complexity of our method, addressing practical constraints identified in prior studies. Extensive empirical evaluations validate the effectiveness of the proposed Discrete Markov Bridge, which achieves an Evidence Lower Bound (ELBO) of 1.38 on the Text8 dataset, outperforming established baselines. Moreover, the proposed model demonstrates competitive performance on the CIFAR-10 dataset, achieving results comparable to those obtained by image-specific generation approaches.
