Flow Matching: A Minimal Guide

Learning vector fields in continuous and discrete spaces

10 November 2025

Flow Matching (FM) (Lipman et al., 2023) has seen impressive success in image and video generation [Link]. The key idea is to train a neural network to learn the underlying vector (velocity) field that deterministically pushes particles along this transport path.

Continuous State Space

Problem setup

Consider a vector field $\mathrm{v_t : \mathbb{R}^d \to \mathbb{R}^d}$ and define its flow map $\mathrm{\phi_t}$ as the solution of the ODE

\[\begin{equation} \mathrm{\frac{d}{dt}\phi_t(x) = v_t(\phi_t(x)), \quad \phi_0(x) = x}.\notag \end{equation}\]

Let $\mathrm{p_0}$ and $\mathrm{p_1}$ be the prior and data distributions, respectively. The marginal probability $\mathrm{p_t:=[\phi_t]_\# p_0}$ is defined as the pushforward of $\mathrm{p_0}$ under $\mathrm{\phi_t}$. Equivalently, by change-of-variables [Link]:

\[\begin{equation} \mathrm{p_t(x) = p_0(\phi_t^{-1}(x)) \mid \det \nabla \phi_t^{-1}(x) \mid.} \end{equation}\]

The marginal probability $\mathrm{p_t}$ satisfies the continuity equation (see Theorem 1 (Chen et al., 2018)):

\[\begin{equation} \mathrm{\frac{d}{d t} p_t(x) = - \nabla_x \cdot \big( v_t(x)\, p_t(x) \big).}\notag \end{equation}\]

To approximate the true vector field $\mathrm{v_t(x)}$ via a parameterize $\mathrm{v_{\theta}(t, x)}$, the ideal regression loss is:

\[\begin{equation} \mathrm{\mathcal{L}_{\mathrm{FM}}(\theta)= \mathbb{E}_{t \sim \mathcal{U}[0,1],\, x \sim p_t}\big[ \| v_\theta(t,x) - v_t(x) \|^2 \big]}.\notag \end{equation}\]

This is, however, intractable, since both $\mathrm{p_t}$ and $\mathrm{v_t}$ are unknown.

Conditional Flow Matching (CFM)

To make the loss more tractable, we introduce a conditioning variable $\mathrm{x_1\sim p_1}$ and define a conditional flow map $\mathrm{\psi_t(\cdot \mid x_1)}$ along with a conditional vector field $\mathrm{v_t(x \mid x_1)}$ satisfying

\[\begin{align} \mathrm{p_t(\cdot\mid x_1)} &\mathrm{= [\psi_t]_\# p_0(\cdot|x_1)=(\psi_t)_\# p_0.}\label{map_psi} \\ \mathrm{\frac{\mathrm{d}}{\mathrm{d}t} \psi_t(x \mid x_1)} &\mathrm{= v_t\big( \psi_t(x \mid x_1) \mid x_1 \big)}, \quad \psi_0(x \mid x_1) = x.\label{flow_eqn}\\ \end{align}\]

The unconditional velocity field can then be expressed as a conditional expectation:

\[\begin{equation} \mathrm{v_t(x)=\int v_t(x \mid x_1) \dfrac{p_t(x\mid x_1) q(x_1)}{p_t(x)}dx_1}.\notag \end{equation}\]

Recall that $\mathrm{\psi_t}$ pushes the prior distribution from $\mathrm{p_0}$ to $\mathrm{p_t(\cdot\mid x_1)}$ in \eqref{map_psi}. We can re-define the CFM loss:

\[\begin{align} \mathrm{\mathcal{L}_{\mathrm{CFM}}(\theta)}&\ = \ \ \ \mathrm{\mathbb{E}_{t,\, x_1\sim p_1,\ x \sim p_t(\cdot|x_1)}\big[ \| v_\theta(t,x) - v_t(x\mid x_1) \|^2 \big]} \notag\\ &\mathrm{\overset{\text{Eq.}\eqref{map_psi}}{=}\mathbb{E}_{t,\, x_1\sim p_1,\, x_0 \sim p_0}\Big[\big\|v_\theta\big(t, \psi_t(x_0 \mid x_1)\big)- v_t\big( \psi_t(x \mid x_1) \mid x_1 \big)\big\|^2\Big]}\notag \\ &\mathrm{\overset{\text{Eq.}\eqref{flow_eqn}}{=}\mathbb{E}_{t,\, x_1\sim p_1,\, x_0 \sim p_0}\Big[\big\|v_\theta\big(t, \psi_t(x_0 \mid x_1)\big)- \frac{\mathrm{d}}{\mathrm{d}t} \psi_t(x_0 \mid x_1)\big\|^2\Big].} \notag \\ \end{align}\]

By regressing $\mathrm{v_{\theta}}$ to match the conditional vector field, we obtain an unbiased estimator of the FM loss.

Special Flow Maps

Consider a map $\mathrm{\psi_t(x\mid x_1)} \mathrm{=\sigma_t(x_1) x + \mu_t(x_1)}$. Taking time gradient and combining Eq.\eqref{flow_eqn}, we have

\[\begin{align} \mathrm{\dfrac{d }{dt}\psi_t(x\mid x_1)} &\mathrm{=\sigma_t'(x_1) x + \mu_t'(x_1)=v_t\big( \psi_t(x \mid x_1) \mid x_1 \big)}. \notag \\ \end{align}\]

Replacing $\mathrm{\psi_t(x \mid x_1)}$ with $\mathrm{x}$, s.t. $\mathrm{\psi_t(x \mid x_1):=\dfrac{x-\mu_t(x_1)}{\sigma_t(x_1)}}$, we have

\[\begin{align} \mathrm{v_t\big(x \mid x_1 \big)} &\mathrm{=\sigma_t'(x_1) \bigg(\dfrac{x-\mu_t(x_1)}{\sigma_t(x_1)}\bigg) + \mu_t'(x_1)}. \label{vector_field_formula} \\ \end{align}\]

Connections to Diffusion Models: For VE-SDE (Song et al., 2021), we have $\mathrm{p_t(x)=N(x\mid x_1, \sigma_{1-t}^2I)},$ the conditional vector field follows $\mathrm{v_t(x\mid x_1)=-\frac{\sigma_{1-t}'}{\sigma_{1-t}}(x-x_1)}$ via Eq.\eqref{vector_field_formula}. For VP-SDE, the conditional probability follows $\begin{equation} \mathrm{p_t(x \mid x_1) = N\!\left( x \,\middle|\, \alpha_{1-t} x_1,\, (1 - \alpha_{1-t}^2) I \right)}\notag \end{equation}$, where $\mathrm{\alpha_t = e^{-\tfrac{1}{2}\int_0^t \beta(s)\,ds}}$. The vector field can be derived in the same way.

Connections to Optimal Transport (OT): Consider the OT flow map:

\[\begin{equation}\boxed{\mathrm{\psi_t(x\mid x_1)=(1-t)x+tx_1}}.\notag\end{equation}\]

which corresponds to a displacement map $\mathrm{p_t=[(1-t)id+t\psi]_* p_0}$. The conditional vector field follows $\mathrm{v_t(x\mid x_1)=\dfrac{x_1-x}{1-t}}$. The simplified CFM loss function follows that

\[\begin{align} \boxed{\mathrm{\mathcal{L}_{\mathrm{CFM-OT}}(\theta)=\mathbb{E}_{t,\, x_1\sim p_1,\, x_0 \sim p_0}\Big[\big\|v_\theta\big(t, \psi_t(x_0 \mid x_1)\big)- (x_1-x_0)\big\|^2\Big]}}. \notag \\ \end{align}\]

Network Parameterization

In diffusion models, noise prediction (Song et al., 2021) learns to predict the added noise $\mathrm{v_t(x \mid x_0)=p_t(x\mid x_0)}$, while data prediction (Karras et al., 2022) learns to recover the clean data $\mathrm{v_t(x \mid x_1)=p_t(x\mid x_1)}$ directly. They’re mathematically equivalent, but the denoiser view often makes training more stable and easier to control.

Discrete State Spaces

Problem setup

Consider a sequence $\mathrm{x=\{x^1, \cdots, x^d\}\in \mathcal{S}}$, where $\mathcal{S}$ is the discrete state space, $\mathrm{x^i\in \mathcal{V}}$ is a token or coordinate and $\mathcal{V}$ is a vocabulary $\mathrm{\{1, 2, \cdots, V\}}$. We define the sequence-level $\mathrm{\delta(x,y)=1}$ if $\mathrm{x=y}$ and 0 otherwise. It is also used in token-level s.t. $\mathrm{\delta(x^i, y^i)}$ for some $\mathrm{x^i, y^i\in \mathcal{V}}$.

Continuous-time Markov Chain (CTMC)

For a CTMC, a random variable $\mathrm{(X_t)_{0\leq t\leq 1}}$ induces a transition kernel $\mathrm{p_{t+h\mid t}}$ defined as

\[\begin{align} &\mathrm{p_{t+h\mid t}(y\mid x):=\mathbb{P}(X_{t+h}=y\mid X_t=x)=\delta(y, x)+h v_t(y, x)+o(h), \quad \mathbb{P}(X_0=x)=p(x)}, \notag \end{align}\]

where $\mathrm{v_t(y, x)}$ denotes the transition rate or velocity field from state $\mathrm{x\in \mathcal{S}}$ to $\mathrm{y\in \mathcal{S}}$:

\[\begin{align} &\mathrm{v_t(y, x)\geq 0 \text{ for all } y\neq x, \text{ and } \sum_y v_t(y, x)=0.} \label{u_limit} \end{align}\]

The marginal probability $\mathrm{p_t}$ for the random variable $\mathrm{(X_t)_{0\leq t\leq 1}}$ satisfy the Kolmogorov forward equation

\[\begin{align} \mathrm{\dfrac{d}{dt}p_t(y)}&\mathrm{=\sum_x v_t(y, x)p_t(x)} \label{v_def} \\ &\mathrm{=\sum_{x\neq y} v_t(y, x)p_t(x) + v_t(y, y)p_t(y)} \notag \\ &\mathrm{\overset{\eqref{u_limit}}{=}\underbrace{\sum_{x\neq y} v_t(y, x)p_t(x)}_{\text{incoming flux}}-\underbrace{\sum_{x\neq y} v_t(x, y)p_t(y)}_{\text{outgoing flux}}}. \notag \end{align}\]

State Transitions with At-most One Token

Note that naïve transitions from $\mathrm{x}$ to all possible states $\mathrm{y}$ results in a huge output dimension ${\mathrm{\textbf{V}^d}}$. we introduce factorized velocities that only allow transitions affecting at most one token:

\[\begin{align} &\mathrm{v_t(y, x)=\sum_i \delta(y^{\bar i}, x^{\bar i}) v_t^i(y^i, x)}, \label{factor_u} \end{align}\]

where $\mathrm{\bar i=(1, \cdots, i-1, i+1, \cdots, d)}$. It thus suffices to model $\mathrm{v_t^i(y^i, x)}$ instead of $\mathrm{v_t(y, x)}$ and the modeling complexity is significantly reduced from ${\mathrm{\textbf{V}^d}}$ to ${\mathrm{\textbf{V}d}}$.

General state transitions (left) vs. at-most-one-token transitions (right) (Lipman et al., 2024).

Given $\mathrm{X_0\sim p_0}$ and the factorized paths and velocities, we can sample $\mathrm{X_t}$ using the Euler method

\[\begin{align} \mathrm{\mathbb{P}(X_{t+h}=y\mid X_t=x)}&=\mathrm{\delta(y, x)+h v_t(y, x) +o(h)} \notag \\ &\overset{\eqref{factor_u}}{=}\mathrm{\delta(y, x)+h \sum_i \delta(y^{\bar i}, x^{\bar i}) v_t^i(y^i, x) + o(h)} \notag \\ &=\prod_i \bigg[\mathrm{\delta(y^i, x^i)+v_t^i(y^i, x) + o(h)}\bigg], \notag \end{align}\]

where the last equality follows from $\mathrm{\prod_i (a^i + hb^i)=\prod_i a^i + h\sum_i \big(\prod_{j\neq i} a^j\big)b^i + o(h)}$.

Conditional Velocity

Analogous to the continuous case, (Gat et al., 2024) introduced the conditional velocity field s.t.

\[\begin{align} \mathrm{v_t(y, x)=\sum_{x_0,x_1} v_t(y, x\mid x_0, x_1)p_{0, 1\mid t}(x_0, x_1 \mid x)=\mathbb{E}[v_t(y, X_t\mid X_0, X_1)\mid X_t=x]}, \notag \\ \end{align}\]

where $\mathrm{p_{0, 1\mid t}(x_0, x_1\mid x)=\dfrac{p_{t\mid 0, 1}(x\mid x_0, x_1) p_{X_0, X_1}(x_0, x_1)}{p_t(x)}}$.

The factorized conditional path assumes that

\[\begin{align} \mathrm{p_{t\mid 0, 1} (x\mid x_0, x_1)}&\mathrm{=\prod_i p_{t\mid 0, 1}^i (x^i \mid x_0, x_1)}, \notag \\ \end{align}\]

where each $\mathrm{p_{t\mid 0, 1}^i (x^i \mid x_0, x_1)}$ follows an interpolation schedule $\mathrm{(\kappa_t)_{t\in[0, 1]}}$ and $\mathrm{\kappa_0=0, \kappa_1=1}$

\[\begin{align} &\boxed{\mathrm{p_{t\mid 0, 1}^i (x^i \mid x_0, x_1)=(1-\kappa_t)\delta(x^i, x_0^i)+\kappa_t \delta(x^i, x_1^i)}}. \label{mixture_entry} \end{align}\]

A random variable $\mathrm{X_t^i\sim p_{t\mid 0, 1}^i}$ follows

\[\begin{align} \boxed{ \mathrm{ X_t^i = (1 - B_t)\,x_0^i + B_t\,x_1^i, \qquad B_t \sim \mathrm{Bernoulli}(\kappa_t). } } \end{align}\]

The dynamics of the conditional marginal probability $\mathrm{p_t^i}$ then satisfies (Lipman et al., 2024)

\[\begin{align} \mathrm{\dfrac{d}{dt} p_{t\mid 0, 1}^i(y^i\mid x_0, x_1)}&\overset{\eqref{mixture_entry}}{=}\mathrm{\dot{\kappa_t}\bigg[\delta(y^i, x_1^i)-\delta(y^i, x_0^i)\bigg]} \notag \\ &\overset{\eqref{mixture_entry}}{=}\mathrm{\dot{\kappa_t}\bigg[\delta(y^i, x_1^i)-\dfrac{p_{t\mid 0, 1}^i(y^i\mid x_0, x_1)-\kappa_t\delta(y^i, x_1^i)}{1-\kappa_t}\bigg]} \notag \\ &=\mathrm{\frac{\dot{\kappa_t}}{1-\kappa_t}\bigg[\delta(y^i, x_1^i)-p_{t\mid 0, 1}^i(y^i\mid x_0, x_1)\bigg]} \notag \\ &=\mathrm{\sum_{x^i}\frac{\dot{\kappa_t}}{1-\kappa_t}\bigg[\delta(y^i, x_1^i)-\delta(y^i, x^i)\bigg]p_{t\mid 0,1}^i(y^i\mid x_0, x_1)}. \notag \\ \end{align}\]

Thus, the conditional velocity follows by Eq.\eqref{v_def}

\[\begin{align} \boxed{\mathrm{v_t^i(y^i, x^i\mid x_0, x_1)=\frac{\dot{\kappa_t}}{1-\kappa_t}\bigg[\delta(y^i, x_1^i)-\delta(y^i, x^i)\bigg]}}. \notag \\ \end{align}\]

However, constructing probability-preserving velocities remains challenging and typically requires additional learning of $\mathrm{p_{0\mid t}^i}$ (Lipman et al., 2024).

Velocity Parameterization

\[\begin{align} \mathrm{v_t(y, x)}&=\mathrm{\sum_{x_0,x_1} v_t(y, x\mid x_0, x_1)p_{0, 1\mid t}(x_0, x_1 \mid x)} \notag \\ &=\mathrm{\sum_{x_1^i} v_t(y, x\mid x_0, x_1)p^i_{1\mid t}(x_1^i \mid x)}, \notag \\ \end{align}\]

where $\mathrm{p^i_{1\mid t}(x_1 \mid x)=\sum_{x_0, x_1^{\bar i}} p_{0, 1\mid t}(x_0, x_1 \mid x)=\mathbb{E}\big[\delta(x_1^i, X_1^i)\mid X_t=x\big]}$.

Hence, one may learn $\mathrm{v_t^i(y^i, x)}$ via a parametrized model $\mathrm{p_{1\mid t}^{\theta, i}(x_1^i\mid x)}$ as the data prediction. A suitable conditional matching loss is

\[\begin{align} \mathrm{\mathcal{L}_{CM}(\theta)=\mathbb{E}_{t, X_0, X_1, X_t} D_{X_t}\bigg(\delta(\cdot, X_1^i), p_{1\mid t}^{\theta, i}(\cdot \mid X_t)\bigg)}. \notag \\ \end{align}\]

Citation

@misc{deng2025flowmatching,
  title   ={{Flow Matching: A Minimal Guide}},
  author  ={Wei Deng},
  journal ={waynedw.github.io},
  year    ={2025},
  howpublished = {\url{https://www.weideng.org/posts/flow_matching/}}
}

Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., & Le, M. (2023). Flow Matching for Generative Modeling. ICLR.
Lipman, Y., Havasi, M., Holderrieth, P., Shaul, N., Le, M., Karrer, B., T. Q. Chen, R., Lopez-Paz, D., Ben-Hamu, H., & Gat, I. (2024). Flow Matching Guide and Code.
Chen, R. T. Q., Rubanova, Y., Bettencourt, J., & Duvenaud, D. (2018). Neural Ordinary Differential Equations. NeurIPS.
Song, Y., Sohl-Dickstein, J., P. Kingma, D., Kumar, A., Ermon, S., & Poole, B. (2021). Score-Based Generative Modeling through Stochastic Differential Equations. International Conference on Learning Representations (ICLR).
Karras, T., Aittala, M., Aila, T., & Laine, S. (2022). Elucidating the Exploratory Design Space of Diffusion-Based Generative Models. NeurIPS.
Gat, I., Remez, T., Shaul, N., Kreuk, F., T. Q. Chen, R., Synnaeve, G., Adi, Y., & Lipman, Y. (2024). Discrete Flow Matching. NeurIPS.