Drifting Field Policy: A One-Step Generative Policy via Wasserstein Gradient Flow

KAIST

Teaser Figure or Video

Replace with static/images/teaser.png or static/videos/teaser.mp4

TL;DR - DFP directly updates a one-step generative policy in action space through a drifting-field objective, avoiding ODE trajectory-level credit assignment.

Abstract

Drifting Field Policy (DFP) is a non-ODE one-step generative policy built on drifting models. It frames policy improvement as a Wasserstein-2 gradient flow toward the soft policy improvement target. The resulting drift field combines attraction toward high-Q actions with repulsion and regularization from the current or anchor policy.

Because the exact soft target is intractable, DFP uses a tractable top-K critic-selected action surrogate that makes the method easy to implement. Experiments on Robomimic and OGBench show that DFP achieves state-of-the-art performance among one-step and ODE-based generative policies.

Method

Drifting Field Policy (DFP) turns policy improvement into a direct action-space drifting update for a one-step generative policy. Instead of learning a denoising chain or an ODE trajectory, DFP learns a pushforward map that transports noise samples into improved actions in a single step.

A. One-Step Generative Policy

$$ \pi_\theta(\cdot \mid s) = [f_\theta(\cdot, s)]_\# p_\epsilon, \qquad a = f_\theta(\epsilon, s), \qquad \epsilon \sim \mathcal{N}(0, I). $$

The policy is represented as the pushforward of a simple noise distribution through a conditional action generator. At inference time, DFP samples once and maps directly to an action, with no denoising chain or ODE solve.

B. Wasserstein Gradient Flow Policy Improvement

$$ \pi^+(a \mid s) = \frac{\pi_{\mathrm{old}}(a \mid s)\exp(Q_\phi(s,a)/\alpha)}{Z(s)}. $$

DFP interprets policy improvement as a Wasserstein-2 gradient flow toward the soft policy improvement target. The ideal drift combines attraction toward high-Q actions with score-based repulsion and regularization around the current or anchor policy.

C. Top-K Drifting Surrogate

$$ P_K(s) = \mathrm{TopK}_{a^{(j)} \sim \pi_{\mathrm{old}}(\cdot \mid s)} Q_\phi(s,a^{(j)}), \qquad \mathcal{L}_{\mathrm{top}\text{-}K} = \mathcal{L}_{\mathrm{drift}}(\theta; P_K, \pi_\theta). $$

Because the exact soft target is intractable, DFP samples candidate actions from the old policy, selects the top-K actions under the critic, and uses those actions as a practical positive set in the drifting loss.

Why Non-ODE Matters

Diffusion, flow, and MeanFlow policies learn time-indexed velocities or ODE-related objectives, so reward or top-K supervision must be absorbed through the generation trajectory. DFP instead updates the action distribution directly through a one-step pushforward map.

Results

DFP is evaluated on 12 manipulation tasks across Robomimic and OGBench under the offline-to-online RL setting. It achieves the best average success rate among all baselines, ranking first on 9 of 12 tasks and second-best on the remaining 3.

95.8%

Avg. Success

9/12

Best Tasks

+9.0 pp

over QC-FQL

+15.5 pp

over MVP

Main Offline-to-Online RL Results

Success rate (%) on Robomimic and OGBench tasks under the offline-to-online RL setting. Each cell reports mean ± std over 5 seeds. Best results are shown in bold; second-best results are underlined.

Method Robomimic Cube-double Cube-triple Cube-quadruple-100m Avg.
lift square can task2 task3 task4 task2 task3 task4 task2 task3 task4
BFN 97.6±2 32.8±8 82.0±2 86.0±5 88.8±5 27.2±8 7.6±9 6.8±3 0.0±0 32.4±21 0.0±0 0.0±0 38.4
QC-BFN 99.6±1 88.4±4 90.6±3 99.8±0 99.8±0 92.6±6 87.4±10 80.8±4 33.4±9 95.8±2 63.2±10 74.2±11 83.8
FQL 96.8±2 10.8±7 58.4±8 93.2±8 91.2±5 6.0±6 0.4±1 6.4±8 0.0±0 0.0±0 0.0±0 0.0±0 30.3
QC-FQL 100.0±0 72.0±9 94.4±2 100.0±0 99.8±0 99.8±0 88.2±2 60.4±12 51.4±24 98.0±2 85.0±7 92.2±7 86.8
MVP 99.8±0 79.4±4 83.6±5 98.4±1 98.6±1 94.8±4 86.2±4 57.2±10 31.0±20 96.6±2 47.2±30 91.2±2 80.3
DFP (Ours) 100.0±0 93.2±2 90.6±3 100.0±0 99.6±1 99.6±1 98.4±1 91.6±2 81.2±6 99.6±1 96.6±2 99.0±2 95.8

Full Offline + Online Training Curves

Success rate over offline pretraining and online fine-tuning phases across Robomimic and OGBench tasks.

Full offline and online training curves

Qualitative Rollouts

Representative DFP rollouts on Robomimic and OGBench manipulation tasks.

Robomimic Square Pick-and-place manipulation
Cube-double Two-cube rearrangement
Cube-triple Three-cube rearrangement
Cube-quadruple Long-horizon four-cube rearrangement

BibTeX

@article{koo2026driftingfieldpolicy,
  title={Drifting Field Policy: A One-Step Generative Policy via Wasserstein Gradient Flow},
  author={Koo, Juil and Park, Mingue and Choi, Jiwon and Min, Yunhong and Sung, Minhyuk},
  journal={arXiv preprint},
  year={2026}
}