Policy Gradients and Stochastic Control
Thoughts on diffusion model alignment
Stochastic Control
Consider a diffusion process described by
\[\begin{align}\label{uncontrolled_process} \mathrm{d X_t=b(X_t, t)dt+dW_t, \ \ t\in[0, 1]; \ X_0^u=x_0.} \end{align}\]This process induces a probability measure $\mathbb{P}$.
Now introduce a controlled diffusion process governed by
\[\begin{align}\label{controlled_process} \mathrm{d X_t^{u}=\big[b(X_t^u, t)+u(X_t^u, t)\big]dt+dW_t, \ \ t\in[0, 1]; \ X_0^u=x_0.} \end{align}\]which generates a different path measure $\mathbb{P}^u$ under control $u$.
We also define the cost-to-go function
\[\begin{align}\label{def_J} \mathrm{J^u(x, t):=E\bigg[\int_t^1 \frac{\alpha}{2}\|u(X_s^u, s)\|_2^2 ds - r(X_1^u)\bigg|X_t^u = x\bigg]},\notag \end{align}\]where $r(\cdot)$ is a terminal reward function.
Define the value function
\[\begin{align}\label{def_value_func} \mathrm{v(x, t):=\inf_{u\in\mathcal{U}} J^u(x, t)}.\notag \end{align}\]Bellman Equation
The dynamic programming principle yields the Bellman equation (Tzen & Raginsky, 2019):
\[\begin{align}\label{HJB} \mathrm{\partial_t v +\nabla v^\intercal b_t+\frac{1}{2} \Delta v =-\inf_{u\in\mathcal{U}} \bigg[\frac{\alpha}{2}\|u(x, t)\|_2^2 + \nabla v^\intercal u_t\bigg]}\notag. \end{align}\]Solving the optimal control yields:
\[\begin{align} \mathrm{\partial_t v +\nabla v^\intercal b_t -\frac{1}{2\alpha}\|\nabla v\|^2_2 +\frac{1}{2} \Delta v=0}\notag. \end{align}\]Consider the Cole-Hopf transformation:
\[\begin{align} \mathrm{\phi(x, t):=\exp\bigg(\frac{-v(x, t)}{\alpha}\bigg) \Leftrightarrow v(x, t)=-\alpha\log \phi(x, t)}\notag. \end{align}\]Applying this change of variable, we can obtain a linear backward Kolmogorov equation
\[\begin{align} \mathrm{\partial_t \phi + \nabla \phi^\intercal b_t +\frac{1}{2} \Delta \phi=0}\notag, \end{align}\]where $\mathrm{\phi(x, 1)=\exp\big(\frac{-v(x, 1)}{\alpha}\big)=\exp\big(\frac{r(x)}{\alpha}\big)}$.
By Feynman-Kac theorem (Zhang & Chen, 2022), the value function follows that
\[\begin{align} \mathrm{v(x, t)=-\alpha\log E\bigg[\exp\bigg(\frac{r(X_1)}{\alpha}\bigg)\bigg|X_t=x\bigg]}\notag. \end{align}\]where $\mathrm{X_t}$ evolves according to the uncontrolled process \eqref{uncontrolled_process}.
Objective Reformulation
Comparing the path measures $\mathbb{P}^u_{t,x}$ and $\mathbb{P}_{t,x}$, Girsanov theorem leads to the Radon–Nikodym derivative
\[\begin{align} \mathrm{KL}\left( \mathbb{P}^u_{t,x} \,\|\, \mathbb{P}_{t,x} \right) &= \mathbb{E}_{\mathbb{P}^u_{t,x}} \left[ \log \left( \frac{\mathrm{d} \mathbb{P}^u_{t,x}}{\mathrm{d} \mathbb{P}_{t,x}} \right) \right] \notag \\ &= \mathrm{\mathbb{E}_{\mathbb{P}^u_{t,x}} \left[\int_t^1 \| u(X_s, s) \|^2 \, \mathrm{d}s \right]}\notag \end{align}\]With this, the cost-to-go function can be reformulated as (Domingo‑Enrich et al., 2025):
\[\begin{align}\label{def_J_2} \mathrm{J^u(x, t):=\frac{\alpha}{2}\mathrm{KL}\left( \mathbb{P}^u_{t,x} \,\|\, \mathbb{P}_{t,x} \right) + \mathbb{E}\big[- r(X_1^u)\big|X_t^u = x\big]}, \end{align}\]where $\mathrm{X^u_t}$ is simulated from the controlled process \eqref{controlled_process}.
Connections to Alignment in Diffusion Models
Consider a discretization of the diffusion path measure such that \(\mathrm{\mathbb{P}^u_{0,x}\approx \prod_{t=T}^0 p^{\theta}_t}\) and \(\mathrm{\mathbb{P}_{0,x}\approx \prod_{t=T}^0 p_t}\). For consistency with the backward time indexing used in diffusion models, we also reverse the time axis.
The minimization of $\mathrm{J^u(x, t)}$ in \eqref{def_J_2} is equivalent to
\[\begin{align}\label{RL_objective} \mathrm{argmax_{\{p^{\theta}_t\}_{t=T}^0} E_{\{p^{\theta}_t\}_{t=T}^0}\bigg[r(x_0)-\frac{\alpha}{2} \sum_{t=T}^1 KL(p^{\theta}_{t-1}(x_{t-1}|x_t)\| p_{t-1}(x_{t-1}|x_t))\bigg]} \end{align}\]which recovers a standard objective in RL-based fine-tuning for diffusion models (Fan et al., 2023).
Taking the gradient of Eq.\eqref{RL_objective} approximately yields the following
\[\begin{align} \mathrm{E_{\{p^{\theta}_t\}_{t=T}^0}\bigg[r(x_0)\sum_{t=T}^1 \nabla \log p^{\theta}_t(x_{t-1}|x_t) -\frac{\alpha}{2} \sum_{t=T}^1 \nabla KL(p^{\theta}_{t-1}(x_{t-1}|x_t)\| p_{t-1}(x_{t-1}|x_t))\bigg]}.\label{grad_RL}\notag \end{align}\]The first part of gradient has also been adopted by the classic REINFORCE algorithm, which, however, suffers from the large variance issue.
Variance Reduction with a Value Function Baseline
Motivated by control variate/ actor-critic method, we consider a baseline \(\mathrm{V^{\theta}(x_t):= \mathbb{E}[r(x_0) \\| x_t]}\) to mimic the advantage function:
\[\begin{align} \mathrm{E_{\{p^{\theta}_t\}_{t=T}^0}\bigg[\sum_{t=T}^1 \big(r(x_0)-V^{\theta}(x_t)\big)\nabla \log p^{\theta}_t(x_{t-1}|x_t)\bigg]}.\label{grad_RL_VR}\notag \end{align}\]We can easily check that \(\mathrm{E_{p^{\theta}_t}\bigg[V^{\theta}(x_t)\nabla \log p^{\theta}_t(x_{t-1}|x_t)\bigg]= V^{\theta}(x_t)\nabla \int p^{\theta}_t(x_{t-1}|x_t)=0}\). As such, the gradient variance can be reduced significantly given a good approximator of the baseline \(\mathrm{V^{\theta}(x_t)}\).
Importance Sampling and Ratio Clipping
Simulating trajectories can be computationally expensive. To improve sample efficiency, we incorporate importance sampling to reuse previously collected trajectory samples:
\[\begin{align} &\quad \mathrm{E_{\{p^{\theta}_t\}_{t=T}^1}\bigg[\sum_{t=T}^1 \big(r(x_0)-V^{\theta}(x_t)\big)\nabla \log p^{\theta}_t(x_{t-1}|x_t)\bigg]}\notag \\ &=\mathrm{E_{\{p^{\theta_{old}}_t\}_{t=T}^1}\bigg[\sum_{t=T}^1 \big(r(x_0)-V^{\theta}(x_t)\big)\frac{p_t^{\theta}(x_{t-1}|x_t)}{p_t^{\theta_{old}}(x_{t-1}|x_t)}\nabla \log p^{\theta}_t(x_{t-1}|x_t)\bigg]}\notag \\ &=\mathrm{E_{\{p^{\theta_{old}}_t\}_{t=T}^1}\bigg[\sum_{t=T}^1 \big(r(x_0)-V^{\theta}(x_t)\big)\nabla\frac{p_t^{\theta}(x_{t-1}|x_t)}{p_t^{\theta_{old}}(x_{t-1}|x_t)}\bigg]}.\notag \\ \end{align}\]Inspired by the trust region approach in TRPO (Schulman et al., 2015), we stabilize training by clipping the importance weight within an \(\epsilon\)-bounded interval. This yields the following clipped surrogate objective:
\[\begin{align} \mathrm{E_{\{p^{\theta_{old}}_t\}_{t=T}^1}\bigg[\sum_{t=T}^1 \big(r(x_0)-V^{\theta}(x_t)\big)\nabla Clip\bigg(\frac{p_t^{\theta}(x_{t-1}|x_t)}{p_t^{\theta_{old}}(x_{t-1}|x_t)} , 1-\epsilon, 1+\epsilon\bigg)\bigg]}.\notag \\ \end{align}\]This procedure closely resembles the Proximal Policy Optimization (PPO) algorithm (Schulman et al., 2017), which has become a standard approach in the alignment of large-scale language models.
- Tzen, B., & Raginsky, M. (2019). Theoretical Guarantees for Sampling and Inference in Generative Models with Latent Diffusions. Proceedings of the 32nd Conference on Learning Theory (COLT), 3084–3114.
- Zhang, Q., & Chen, Y. (2022). Path Integral Sampler: a stochastic control approach for sampling. Proceedings of the International Conference on Learning Representations (ICLR).
- Domingo‑Enrich, C., Drozdzal, M., Karrer, B., & Chen, R. T. Q. (2025). Adjoint Matching: Fine-tuning Flow and Diffusion Generative Models with Memoryless Stochastic Optimal Control. Proceedings of the International Conference on Learning Representations (ICLR).
- Fan, Y., Watkins, O., Du, Y., Liu, H., Ryu, M., Boutilier, C., Abbeel, P., Ghavamzadeh, M., Lee, K., & Lee, K. (2023). DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models. Advances in Neural Information Processing Systems 37 (NeurIPS 2023).
- Schulman, J., Levine, S., Moritz, P., Jordan, M. I., & Abbeel, P. (2015). Trust Region Policy Optimization. ICML, 1889–1897.
- Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. ArXiv Preprint ArXiv:1707.06347.