Discrete Diffusion Models

Diffusion language models for images, languages, and general state spaces

18 July 2025

Diffusion models have gained notable attention in image generation for continuous state spaces. However, their application to discrete state spaces, such as texts, remains limited. To tackle this issue, D3PM (Austin et al., 2021) propose to model $\mathrm{m}$-dim tokens as follows:

\[\begin{align} \mathrm{\mathbf{p}(x_t|x_{t-1})=\text{Cat}(x_t; p=x_{t-1} \mathbf{T}_t)},\label{D3PM} \end{align}\]

where $\mathrm{x_t}$ is a $\mathrm{(m+1)}$-dim one-hot row vector, augmented with an additional mask state $\mathrm{m}$; $\mathrm{[\mathbf{T}_t]_{i,j}}$ denotes the transition probability $\mathrm{p(x_t=j\\|x_{t-1}=i)}$.

Given an absorbing matrix s.t. $\mathbf{T}_t=(1-\beta_t) \mathbf{I} + \beta_t \mathbf{e_m\cdot 1^\intercal}$, iterating the transitions (Shi et al., 2024)

\[\begin{align} &\mathrm{\mathbf{p}(x_t|x_0)=\text{Cat}(x_t; \mathbf{p}=x_{t-1} \overline{\mathbf{T}}_t), \quad \overline{\mathbf{T}}_t=\mathbf{T}_1 \mathbf{T}_2 \dots \mathbf{T}_t=\alpha_t \mathbf{I} + (1-\alpha_t) \mathbf{e_m\cdot 1^\intercal}},\notag\\ \end{align}\]

where $\alpha_t=\Pi_{i=1}^t (1-\beta_i)$, $\mathrm{\mathbf{e_m}}$ is the one-hot encoding of the [MASK] token at index $\mathrm{m}$ under zero-based indexing and $\mathrm{\mathbf{1}=\{1\}^m}$. Conduct the reverse transition (Shi et al., 2024) via Bayes rule

\[\begin{align} &\mathrm{\mathbf{p}(x_{t-1}|x_t, x_0)=\dfrac{\mathbf{p}(x_t|x_{t-1}) \mathbf{p}(x_{t-1}|x_0)}{\mathbf{p}(x_{t}|x_0)}=\text{Cat}(x_t; p=\dfrac{x_t \mathbf{T}_t^\intercal \odot x_0 \overline{\mathbf{T}}_{t-1}}{x_0 \overline{\mathbf{T}}_t x_t^\intercal}) },\notag\\ &\mathrm{\mathbf{p}(x_s \mid x_t, x_0)} \mathrm{= \frac{\mathbf{p}(x_t \mid x_s) \mathbf{p}(x_s \mid x_0)}{\mathbf{p}(x_t \mid x_0)} =\mathrm{Cat(x_s; \bar{R}^{x_0}(t, s)^\top x_t)}= \Bigg\{ \begin{array}{ll} \mathrm{\frac{\alpha_s - \alpha_t}{1 - \alpha_t} \, x_s^\top x_0} & \small{\mathrm{x_s \ne m,\, x_t = m}} \label{reverse_transition} \\ \mathrm{\frac{1 - \alpha_s}{1 - \alpha_t}} & \small{\mathrm{x_s = m,\, x_t = m}} \\ \mathrm{x_s^\top x_t} & \small{\mathrm{x_t \ne m}}, \end{array}} \end{align}\]

where $\mathrm{\bar{R}^{x_0}(t, s) = \mathbf{I} + \frac{\alpha_s - \alpha_t}{1 - \alpha_t} \mathbf{e_m} (x_0 - \mathbf{e_m})^\top}$, implying $\mathrm{p(x_s=x_0\\|x_t=m)=\frac{\alpha_s - \alpha_t}{1 - \alpha_t}}$.

Connections to BERT and Autoregressive Models

Consider a one-step transition that mixes the uniform transition and an absorbing state $\mathbf{T}=(1-\alpha-\beta) \mathbf{I} +\alpha \mathbf{1\cdot 1^\intercal} + \beta \mathbf{e_m\cdot 1^\intercal}$. For example, BERT (Devlin et al., 2018) models $\mathrm{\mathbf{p}(x_1 \mid x_0)}$ by replacing 10% of tokens with [MASK] and 5% uniformly at random. It also differs from BERT in that a randomized masking ratio is adopted to further enhance the performance.

Learning Backward Transitions

Next, we approximate $\mathrm{x_0}$ via a parametrized model $\mathrm{\mu_\theta(x_t, t)}$, defined as:

\[\begin{align} \mathrm{\mu_\theta(x_t, t) = \begin{cases} \mathrm{softmax}(f_\theta(x_t, t)) & \text{if } \mathrm{x_t = m}, \\ \mathrm{x_t} & \text{otherwise}. \end{cases}}\label{param} \end{align}\]

The reverse transition from $\mathrm{t}$ to $\mathrm{s}$ can be approximated via Eq.\eqref{reverse_transition} and \eqref{param}

\[\begin{equation} \begin{aligned} \mathrm{\mathrm{KL}(\mathbf{p}(x_s \mid x_t, x_0) \parallel \mathbf{p}(x_s \mid x_t, \mu_\theta(x_t, t)))} &= \begin{cases} \mathrm{\sum_{x_s=0}^m \mathbf{p}(x_s \mid x_t, x_0) \log \frac{\mathbf{p}(x_s \mid x_t, x_0)}{\mathbf{p}(x_s \mid x_t, \mu_\theta(x_t, t))}} & \mathrm{x_t = m} \\ 0 & \mathrm{x_t \ne m} \end{cases} \\ &= \mathrm{\sum_{k \ne m} \frac{\alpha_s - \alpha_t}{1 - \alpha_t} \mathbf{1}_{x_t = m} x_0^\top e_k \log \frac{x_0^\top e_k}{\mu_\theta(x_t, t)^\top e_k}} \\ &= \mathrm{- \frac{\alpha_s - \alpha_t}{1 - \alpha_t} \mathbf{1}_{x_t = m} x_0^\top \log \mu_\theta(x_t, t)}, \end{aligned} \notag \end{equation}\]

which forms a cross-entropy loss between the predicted logits and the clean target.

Integrating from $\delta_0\rightarrow 0$ to 1, we have the lower bound of the log marginal likelihood (ELBO)

\[\begin{align} &\mathrm{ELBO= \int_{\delta_0}^1 \frac{\alpha_t'}{1 - \alpha_t} \, \mathbb{E}_{\mathbf{p}(x_t \mid x_0)} \left[ \mathbf{1}_{x_t = m} \cdot x_0^\top \log \mu_\theta(x_t, t) \right] \, dt.} \notag \\ % & \mathrm{\qquad\quad\supset\int_{\delta_0}^1 \frac{1}{t} \, \mathbb{E}_{\mathbf{p}(x_t \mid x_0)} \left[ \mathbf{1}_{x_t = m} \cdot x_0^\top \log \mu_\theta(x_t, t) \right] \, dt.} \notag \\ \end{align}\]

where $\mathrm{\alpha_t’}$ is the derivative of $\mathrm{\alpha_t}$. Replacing $\mathrm{\mathbf{1}_{x_t = m} \cdot x_0^\top \log \mu_\theta(x_t, t)}$ with $\mathrm{\log\langle \mu_\theta(x_t, t), x \rangle }$ recovers the loss in (Sahoo et al., 2024).

Rewrite the single token $\mathrm{x_t}$ to a series of tokens $\mathrm{x_t:=(x_t^{(1)}, x_t^{(2)}, \cdots, x_t^{(n)})}$ with a linear $\mathrm{\alpha_t=1-t}$:

\[\begin{align} &\mathrm{ELBO=\int_{\delta_0}^t \frac{1}{t} \, \mathbb{E}_{\mathbf{p}(x_t \mid x_0)} \left[ \sum_{n} \mathbf{1}_{x_t^{(n)} = m} \cdot (x_0^{(n)})^\top \log \mu^{(n)}_\theta(x_t, t) \right] \, dt,}\label{shi_loss} \end{align}\]

where $\mathrm{\mathbf{p}(x_t \mid x_0)=\Pi_{i=1}^n \mathbf{p}(x_t^{(i)} \mid x_0^{(i)})}$ and $\mathrm{\mu^{(n)}_\theta(x_t, t)}$ is the $n$-th output for the prediction of $\mathrm{E[x_0^{(n)}\\|x_t]}$.

Continuous-Time Markov Chains (CTMC)

For the continuous-time limit, Discrete Diffusion Models (Lou et al., 2024) describe the evolution of $\mathrm{\mathbf{p}_t}$:

\[\begin{align} \mathrm{\frac{d \mathbf{p}_t}{dt}=\mathbf{Q}_t \mathbf{p}_t, \ \ \mathbf{p}_0\sim \mathbf{p}_{\text{data}},\ \ \mathbf{p}_T\approx \mathbf{p}_{\text{base}}},\notag \end{align}\] \[\begin{align}\mathrm{\mathbf{Q}_t = \lim_{\Delta t\rightarrow 0}\frac{\mathbf{T}_t-I}{\Delta t} \in \mathbb{R}^{(m+1) \times (m+1)}}\notag\end{align}\]

is a rate matrix (generator) that governs the frequency and destination of state jumps or transitions (Gat et al., 2024), with each column summing to zero.

For scalability, a common choice is $\mathrm{\mathbf{Q}_t = \sigma_t \mathbf{Q}^{\text{absorb}}}$, where the design of $\sigma_t$ is detailed in (Ou et al., 2025), and the absorbing-type matrix $\mathrm{\mathbf{Q}^{\text{absorb}}}$ has demonstrated superior empirical performance compared to uniform alternatives (Austin et al., 2021).

\[\begin{align} \mathrm{\mathbf{Q}^{\text{absorb}} = } \begin{bmatrix} -1 & 0 & \cdots & 0 & 0 \\ 0 & -1 & \cdots & 0 & 0 \\ \vdots & \vdots & \ddots & \vdots & \vdots \notag \\ 0 & 0 & \cdots & -1 & 0 \\ 1 & 1 & \cdots & 1 & 0 \end{bmatrix}=\mathbf{e_m\cdot 1^\intercal - I}, \end{align}\]

where the bottom-right 0 denotes the absorbing [MASK] token, which remains unchanged once reached, and the –1 diagonals set transition rates from non-mask tokens.

In practice, we can implement the process via Euler steps, which approximates Eq.\eqref{D3PM} as follows

\[\begin{align} \mathrm{\mathbf{p}(x_{t+\Delta t}=y|x_t=x)=\mathbf{1}_{x=y}+\mathbf{Q}_t(y, x)\Delta t+o(\Delta t)}\notag \end{align}\]

Invoking the Bayes rule in Lemma A (Shi et al., 2024), the reverse-time process follows that

\[\begin{align} &\mathrm{\frac{d\mathbf{p}_{T-t}}{dt}=\mathbf{R}_{T-t} \mathbf{p}_{T-t}},\notag \\ \end{align}\]

where $\mathrm{\mathbf{R}_t(y, x)=\frac{\mathbf{p}_t(y)}{\mathbf{p}_t(x)}\mathbf{Q}_t(x, y)}$, $\mathrm{\mathbf{R}_t (x, x)=-\sum_{y \neq x} \mathbf{R}_t(y, x)}$. $\frac{\mathbf{p}_t(y)}{\mathbf{p}_t(x)}$ is referred to as the concrete score (Meng et al., 2022), which generalizes the continuous score $\mathrm{\nabla \log p_t}$ in discrete space. The score can be further simplfied by utilizing the absorbing structures (Shi et al., 2024).

Lemma A The reverse-time transition rate matrix satisfy $\mathrm{\mathbf{R}_t(y, x) = \mathbf{Q}_t(x, y) \frac{\mathbf{p}_{t}(y)}{\mathbf{p}_{t}(x)} \quad \text{for } x \neq y.}$

Proof Reversing the time from time $\mathrm{t+\Delta t}$ to $\mathrm{t}$, Bayes’ rule yields for $\mathrm{x \neq y}$:

\[\begin{align} \mathrm{\mathbf{p}(x_t = y \mid x_{t+\Delta t} = x)} &\mathrm{= \frac{\mathbf{p}_t(y) \mathbf{p}(x_{t+\Delta t} = x \mid x_t = y)}{\mathbf{p}_{t+\Delta t}(x)}} \notag \\ &\mathrm{= \frac{\mathbf{p}_t(y) \big(\mathbf{1}_{x=y} + \mathbf{Q}_t(x, y) \Delta t + o(\Delta t)\big)}{\mathbf{p}_{t+\Delta t}(x)}} \notag \\ &\mathrm{\approx \mathbf{1}_{x=y} + \frac{\mathbf{p}_t(y)}{\mathbf{p}_t(x)} \mathbf{Q}_t(x, y) \Delta t} \quad \text{ for small $\mathrm{\Delta t}$}.\notag \\ \end{align}\]

On the Dimensionality of Sequences

For a length-$n$ sequence $\mathrm{\mathbf{x}=x^1 x^2 \cdots x^n}$, modeling transitions in $\mathrm{\mathbb{R}^{(m+1)^n \times (m+1)^n}}$ is computationally infeasible. To mitigate this, we assume position-wise independence (Lou et al., 2024) and employ a shared transition matrix $\mathrm{\mathbf{Q}_t \in \mathbb{R}^{(m+1) \times (m+1)}}$.

Setting zero values in $\mathrm{\mathbf{Q}_t}$ for all sequences with a Hamming distance larger than 1, it inspires to model the concrete score network between sequences $\mathrm{\widehat{\mathbf{x}}_t=x_t^1 \cdots \widehat{x}_t^i \cdots x_t^n}$ and $\mathrm{\mathbf{x}_t=x_t^1 \cdots x_t^i \cdots x_t^n}$ as $\mathrm{\mathbf{s}_\theta(\cdot, t) : \{0, 1, 2, \ldots, m\}^n \rightarrow \mathbb{R}^{n \times (m+1)}}$:

\[\begin{align} \mathrm{\mathbf{s}_\theta(\mathbf{x}_t, t)_{\widehat{\mathbf{x}}_t} = \mathbf{s}_\theta(x_t^1 \cdots x_t^i \cdots x_t^n, t)[i, \widehat{x}_t^i]\approx \dfrac{\mathbf{p}_t(x_t^1 \cdots \widehat{x}_t^i \cdots x_t^n)}{\mathbf{p}_t(x_t^1 \cdots x_t^i \cdots x_t^n)}}. \notag \end{align}\]

Given the approximation of the concrete score, the reverse-time transition rate matrix follows

\[\begin{align} \mathrm{\mathbf{R}_t(x_t^1 \cdots x_t^i \cdots x_t^n, x_t^1 \cdots \widehat{x}_t^i \cdots x_t^n)}&=\mathrm{\mathbf{Q}_t(\widehat{x}_t^i, x_t^i)\frac{\mathbf{p}_t(x_t^1 \cdots \widehat{x}_t^i \cdots x_t^n)}{\mathbf{p}_t(x_t^1 \cdots x_t^i \cdots x_t^n)}} \notag \\ &\approx \mathrm{\mathbf{Q}_t(\widehat{x}_t^i, x_t^i) \mathbf{s}_\theta(x_t^1 \cdots x_t^i \cdots x_t^n, t)[i, \widehat{x}_t^i]}\notag. \end{align}\]

Moreover, we observe that $\mathrm{\mathbf{Q}_t(\widehat{x}_t^i, x_t^i)=\sigma_t \mathbf{Q}^{absorb}(\widehat{x}_t^i, x_t^i)=0}$ for any unmasked states $x_t^i\notin [\mathbf{M}]$ and $\mathrm{\widehat{x}_t^i \neq x_t^i}$. As such, the concrete score parameterization can be significantly simplified by focusing only on $\mathrm{x_t^i= [\mathbf{M}]}$ and $\mathrm{\widehat{x}_t^i \neq [\mathbf{M}]}$. The reparameterized score can be interpreted as a time-independent conditional probability on clean data (Ou et al., 2025).

Another Simplified Training Loss

A straightforward way to optimize the concrete score function is to use L2 loss (Meng et al., 2022):

\[\begin{align} \mathrm{\frac{1}{2} \mathbb{E}_{\mathbf{x}_t \sim \mathbf{p}_t} \left[ \sum_{\mathbf{x}_t \ne {\mathbf{y}}_t} \left( \mathbf{s}_\theta(\mathbf{x}_t, t)_{\mathbf{y}_t} - \frac{\mathbf{p}_t({\mathbf{y}}_t)}{\mathbf{p}_t({\mathbf{x}}_t)} \right)^2 \right]}.\notag \end{align}\]

However, such a loss does not guarantee positivity of the concrete score and neglects the underlying probabilistic structure. To resolve this issue, (Lou et al., 2024) proposes the Bregman divergence $\mathrm{D_F}(\mathbf{s}_\theta, \frac{\mathbf{p}_t(y)}{\mathbf{p}_t(x)})$ with $\mathrm{F(x)=\mathrm{-\log(x)}}$ as follows

\[\begin{align} \mathrm{D_F\bigg(\mathbf{s}_\theta, \frac{\mathbf{p}_t(y)}{\mathbf{p}_t(x)}\bigg)}&=\mathrm{-\log\mathbf{s}_\theta+\log\bigg(\frac{\mathbf{p}_t(y)}{\mathbf{p}_t(x)}\bigg)+\frac{\mathbf{p}_t(x)}{\mathbf{p}_t(y)}\mathbf{s}_\theta-1} \notag \\ &=\mathrm{\mathbb{E}_{\mathbf{x}_t\sim \mathbf{p}_t}\bigg[\frac{1}{\mathbf{p}_t(y)} \bigg(\mathbf{s}_\theta - \frac{\mathbf{p}_t(y)}{\mathbf{p}_t(x)}\log\mathbf{s}_\theta+K\bigg(\frac{\mathbf{p}_t(y)}{\mathbf{p}_t(x)}\bigg)\bigg)\bigg]},\notag \\ &\approx \mathrm{\mathbb{E}_{\mathbf{x}_t\sim \mathbf{p}_t}\bigg[\sum_{y\neq x} w_{xy} \bigg(\mathbf{s}_\theta - \frac{\mathbf{p}_t(y)}{\mathbf{p}_t(x)}\log\mathbf{s}_\theta+K\bigg(\frac{\mathbf{p}_t(y)}{\mathbf{p}_t(x)}\bigg)\bigg)\bigg]},\label{absorb_loss} \\ \end{align}\]

where $\mathrm{K(a)=a(\log a -1)}$ is a normalizing constant such that $\mathrm{D_F\big(\mathbf{s}_\theta, \frac{\mathbf{p}_t(y)}{\mathbf{p}_t(x)}\big)}\geq 0$ and $\mathrm{w_{xy} \geq 0}$.

The above loss \eqref{absorb_loss} was proposed by SEDD (Lou et al., 2024), which substantially improves the training of discrete diffusion models—achieving GPT-2-level performance but at significantly higher computational cost. To improve scalability, (Ou et al., 2025) draws insightful connections between \eqref{absorb_loss} and the any-order autoregressive loss in \eqref{AO_loss} (Nie et al., 2025) (Uria et al., 2014)(Hoogeboom et al., 2022)(Shih et al., 2022).

\[\begin{align} \mathrm{\mathcal{L}(\theta) \triangleq - \mathbb{E}_{t, x_0, x_t} \left[ \frac{1}{t} \sum_{i=1}^L \mathbf{1}[x_t^{(i)} = \text{m}] \log p_\theta(x_0^{(i)} \mid x_t) \right]} \label{AO_loss}. \end{align}\]

This simplified loss \eqref{AO_loss} is in a spirit similar to the loss \eqref{shi_loss} (Shi et al., 2024) (Sahoo et al., 2024) and forms the core training objective in (Nie et al., 2025), enabling scalability comparable to that of large-scale language models such as LLaMA3 and other multimodal applications (Rojas et al., 2025).

Austin, J., Johnson, D. D., Ho, J., Tarlow, D., & van den Berg, R. (2021). Structured Denoising Diffusion Models in Discrete State‑Spaces. NeurIPS’21.
Shi, J., Han, K., Wang, Z., Doucet, A., & Titsias, M. K. (2024). Simplified and Generalized Masked Diffusion for Discrete Data. NeurIPS’24.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805. https://arxiv.org/abs/1810.04805
Sahoo, S. S., Arriola, M., Schiff, Y., Gokaslan, A., Marroquin, E., Chiu, J. T., Rush, A., & Kuleshov, V. (2024). Simple and Effective Masked Diffusion Language Models. Advances in Neural Information Processing Systems, 37.
Lou, A., Meng, C., & Ermon, S. (2024). Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution. ICML’24.
Gat, I., Remez, T., Shaul, N., Kreuk, F., Chen, R. T. Q., Synnaeve, G., Adi, Y., & Lipman, Y. (2024). Discrete Flow Matching. NeurIPS’24.
Ou, J., Nie, S., Xue, K., Zhu, F., Sun, J., Li, Z., & Li, C. (2025). Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data. ICLR’25.
Meng, C., Choi, K., Song, J., & Ermon, S. (2022). Concrete Score Matching: Generalized Score Matching for Discrete Data. NeurIPS’22.
Nie, S., Zhu, F., You, Z., Zhang, X., Ou, J., Hu, J., Zhou, J., Lin, Y., Wen, J. R., & Li, C. (2025). Large Language Diffusion Models. ArXiv:2502.09992.
Uria, B., Murray, I., & Larochelle, H. (2014). A Deep and Tractable Density Estimator. ICML’14.
Hoogeboom, E., Gritsenko, A. A., Bastings, J., Poole, B., van den Berg, R., & Salimans, T. (2022). Autoregressive Diffusion Models. ICLR’22.
Shih, A., Sadigh, D., & Ermon, S. (2022). Training and Inference on Any-Order Autoregressive Models the Right Way. ICML’22.
Rojas, K., Zhu, Y., Zhu, S., Ye, F. X.-F., & Tao, M. (2025). Diffuse Everything: Multimodal Diffusion Models on Arbitrary State Spaces. ICML’25.