FlowRL can be derived from first principles by turning RLHF-style reward maximization into reverse-KL distribution matching against a reward-weighted target, which yields a tractable sequence-level trajectory-balance loss with a learned partition function and practical stabilizers for long chain-of-thought training. The result is a simple squared “flow residual” objective whose minimization makes sampling proportional to reward (modulated by a reference model), connecting directly to GFlowNet trajectory balance guarantees for mode coverage.

Problem setup

Target distribution view

Reverse KL objective

Gradient of the reverse KL

From reverse KL to trajectory balance

Sequence-level TB loss

Length normalization

Reward standardization

Off-policy correction

Final FlowRL loss

Gradients in detail

Why squared TB matches reverse KL

Relation to GFlowNets

Toy example

Implementation recipe

Practical stabilizers, justified

Contrast with PPO/GRPO

Minimal PyTorch-style loss

# Given: x, y, T, logp_cur_seq, logp_old_seq, logp_ref_seq, reward
# Hyperparams: beta, eps
# Models: policy (for logp_cur_seq), partition_net Z_phi(x)

with torch.no_grad():
    r_hat = (reward - reward.mean(dim=1, keepdim=True)) / (reward.std(dim=1, keepdim=True) + 1e-6)

logp_cur_norm = logp_cur_seq / T
logp_ref_norm = logp_ref_seq / T

with torch.no_grad():
    rho = torch.exp(logp_cur_seq - logp_old_seq)
    w = torch.clamp(rho, 1-eps, 1+eps)

logZ = partition_net(x)  # scalar per prompt
delta = logZ + logp_cur_norm - beta * r_hat - logp_ref_norm
loss = (w * (delta ** 2)).mean()
loss.backward()

This implements length-normalized, sequence-level trajectory balance with a learned partition and detached clipped importance weighting as specified in FlowRL.

Takeaways

References: FlowRL arXiv for derivations, implementation details, and ablations; Trajectory Balance (GFlowNet) for the theoretical guarantee behind proportional sampling via TB.