Table of Contents
Fetching ...

Discrete Adjoint Matching

Oswin So, Brian Karrer, Chuchu Fan, Ricky T. Q. Chen, Guan-Horng Liu

TL;DR

This work extends entropy-regularized reward optimization to discrete generative models by introducing Discrete Adjoint Matching (DAM), a discrete analogue of Adjoint Matching for Continuous-Time Markov Chains. DAM derives a discrete adjoint estimator via a Dynkin-formulation-inspired approach, yielding a fixed-point characterization of the optimal rate $u^*_t(y,x)$ and a practical matching objective that aligns parameterized rates with the optimum. It further provides tractable implementations, including an analytic discrete adjoint, generalized KL matching, and sampling schemes, plus adaptations to masked diffusion models to handle massive discrete spaces. Empirically, DAM improves convergence to the optimal discrete distribution and enhances performance on synthetic CTMC tasks and mathematical reasoning benchmarks (GSM8K, MATH500, Countdown, Sudoku) compared to baselines like D1, signaling its potential for discrete diffusion-style fine-tuning.

Abstract

Computation methods for solving entropy-regularized reward optimization -- a class of problems widely used for fine-tuning generative models -- have advanced rapidly. Among those, Adjoint Matching (AM, Domingo-Enrich et al., 2025) has proven highly effective in continuous state spaces with differentiable rewards. Transferring these practical successes to discrete generative modeling, however, remains particularly challenging and largely unexplored, mainly due to the drastic shift in generative model classes to discrete state spaces, which are nowhere differentiable. In this work, we propose Discrete Adjoint Matching (DAM) -- a discrete variant of AM for fine-tuning discrete generative models characterized by Continuous-Time Markov Chains, such as diffusion-based large language models. The core of DAM is the introduction of discrete adjoint-an estimator of the optimal solution to the original problem but formulated on discrete domains-from which standard matching frameworks can be applied. This is derived via a purely statistical standpoint, in contrast to the control-theoretic viewpoint in AM, thereby opening up new algorithmic opportunities for general adjoint-based estimators. We showcase DAM's effectiveness on synthetic and mathematical reasoning tasks.

Discrete Adjoint Matching

TL;DR

This work extends entropy-regularized reward optimization to discrete generative models by introducing Discrete Adjoint Matching (DAM), a discrete analogue of Adjoint Matching for Continuous-Time Markov Chains. DAM derives a discrete adjoint estimator via a Dynkin-formulation-inspired approach, yielding a fixed-point characterization of the optimal rate and a practical matching objective that aligns parameterized rates with the optimum. It further provides tractable implementations, including an analytic discrete adjoint, generalized KL matching, and sampling schemes, plus adaptations to masked diffusion models to handle massive discrete spaces. Empirically, DAM improves convergence to the optimal discrete distribution and enhances performance on synthetic CTMC tasks and mathematical reasoning benchmarks (GSM8K, MATH500, Countdown, Sudoku) compared to baselines like D1, signaling its potential for discrete diffusion-style fine-tuning.

Abstract

Computation methods for solving entropy-regularized reward optimization -- a class of problems widely used for fine-tuning generative models -- have advanced rapidly. Among those, Adjoint Matching (AM, Domingo-Enrich et al., 2025) has proven highly effective in continuous state spaces with differentiable rewards. Transferring these practical successes to discrete generative modeling, however, remains particularly challenging and largely unexplored, mainly due to the drastic shift in generative model classes to discrete state spaces, which are nowhere differentiable. In this work, we propose Discrete Adjoint Matching (DAM) -- a discrete variant of AM for fine-tuning discrete generative models characterized by Continuous-Time Markov Chains, such as diffusion-based large language models. The core of DAM is the introduction of discrete adjoint-an estimator of the optimal solution to the original problem but formulated on discrete domains-from which standard matching frameworks can be applied. This is derived via a purely statistical standpoint, in contrast to the control-theoretic viewpoint in AM, thereby opening up new algorithmic opportunities for general adjoint-based estimators. We showcase DAM's effectiveness on synthetic and mathematical reasoning tasks.
Paper Structure (35 sections, 23 theorems, 140 equations, 5 figures, 6 tables, 1 algorithm)

This paper contains 35 sections, 23 theorems, 140 equations, 5 figures, 6 tables, 1 algorithm.

Key Result

Lemma 2.0

For a given function $f_t(x)$ and a CTMC model $p^u$, it holds that

Figures (5)

  • Figure 1: Synthetic Examples. We compare the empirical distributions of $X_1$ generated by the base model $p^\text{\normalfont base}_1$, the ground-truth optimal model $p^\star_1$, and four methods, including an ablation of DAM trained with the discrete adjoint in \ref{['eq:dam2']} instead of \ref{['eq:dam3']}. DAM visually aligns most closely with $p^\star_1$.
  • Figure 2: Convergence to Optimal $p^\star$ on Pinwheel. Convergence of $D_{\mathrm{KL}}(p^\star_t || p^u_t)$ at each jump described in \ref{['eq:toy-jump']}, where DAM exhibits stable convergence compared to other methods (left & middle). Our improved discrete adjoint in \ref{['eq:dam3']} exhibits lower bias and variance compared to \ref{['eq:dam2']} (right).
  • Figure 3: Rewards Curves on GSM8K and Countdown. DAM is more effective than D1 in maximizing reward $r$ in \ref{['eq:sub-reward']}.
  • Figure 4: Accuracy vs Wall Clock Time on Sudoku. We compare D1 and DAM but plot the wall clock time in hours on the x-axis.
  • Figure 5: Varying $K$ on Synthetic Examples.We investigate the effect of varying the number of samples $K$ on the synthetic examples. Larger values of $K$ result in faster convergence from the lower bias and variance of the importance-weighted discrete adjoint in \ref{['eq:dam3']}.

Theorems & Definitions (39)

  • Lemma 2.0: Dynkin's formula
  • Theorem 2.1: Discrete adjoint---adjoint system for CTMC
  • Proposition 2.1: Analytic discrete adjoint
  • Proposition 2.1: Importance-weighted discrete adjoint
  • Proposition 2.1: Masked optimal rate
  • Lemma 3.0: Fixed-point equation of $u^\star$
  • Theorem 3.1: Discrete basic adjoint matching
  • Corollary 3.1: Discrete adjoint
  • Lemma A.1
  • proof
  • ...and 29 more