Self-Refining Video Sampling

1KAIST    2NYU    3NTU Singapore    4DeepAuto.ai

*Equal Contribution,      Equal Advising

ArXiv 2026

[TL;DR] We present self-refining video sampling method that reuses a pre-trained video generator as a denoising autoencoder to iteratively refine latents. With ~50% additional NFEs, it improves physical realism (e.g., motion coherence and physics alignment) without any external verifier, training, or dataset.


Methods


Flow Matching as Denoising Autoencoder

We revisit the connection between diffusion models and denoising autoencoders (DAE) [1-2], and extend this connection to interpret flow matching as a DAE from a training objective perspective. In particular, up to a time-dependent weighting, the flow-matching loss is equivalent to a standard DAE [3] reconstruction objective.

Flow matching loss
DAE loss
for which the model learns to denoise the corrupted input $z_t$ back to the clean sample $z_1$.

Thus, flow matching trained model can be used as a time-conditioned DAE. At inference time, we repurpose the pre-trained model as DAE, self-refiner, iteratively refining samples toward the data manifold.

PnP iteration
From a DAE perspective, Predict acts as a denoiser that maps a noisy state $z_t$ toward a clean sample, while Perturb injects noise by interpolating with Gaussian noise at level $t$. Together, they form a simple corrupt–reconstruct loop for DAE-style refinement.
PnP multiple
At a fixed noise level $t$, we iteratively apply them to refine $z_t$, an operation we call Predict-and-Perturb (P&P). Repeating this process for a few iterations $k$ gradually steers latents toward higher density regions (i.e., data manifold).


Figure a
Concept of the self-refining video sampling. Within the same noise level, the video latent $z_t$ is refined as the predicted endpoint $\hat{z}_1$ is pulled toward the data manifold.
Figure b
Sampling comparison on 2D synthetic dataset. (a-b) P&P generates samples closer to the data manifold compared to the Euler solver. (c-d) When fixing the timestep, iterative P&P pulls the prediction $\hat{z}_1$ closer to the data manifold

In practice, only 2–3 iterations are sufficient to improve temporal coherence and physical plausibility. The refined state $z_t^*:= z^{(K)}_{t}$ is then used as the updated state and passed to the next ODE step, enabling plug-and-play integration with existing solvers by simply replacing $z_t$ with $z_t^*$.


[1] Vincent Pascal, A Connection Between Score Matching and Denoising Autoencoders, Neural computation, 2011
[2] Song & Ermon, Generative Modeling by Estimating Gradients of the Data Distribution, NeurIPS, 2019
[3] Bengio et al., Generalized Denoising Auto-encoders as Generative Models, NeurIPS, 2013


Uncertainty-aware Predict-and-Perturb

However, we observe that multiple P&P updates ($K \geq 3$) with classifier-free guidance can lead to over-saturation or simplification in static regions such as the background. To address this, we propose an Uncertainty-aware P&P that selectively refines only the locally uncertain regions. Specifically, we estimate the model confidence of the prediction, $\mathbf{U}=\|\hat{z}_1^{(k)}-\hat{z}_1^{(k+1)}\|_1,$ which measures how sensitive the prediction is to the Perturb step.


Here, we apply the P&P update only to regions where the uncertainty $\mathbf{U}$ exceeds a predefined threshold $\tau$. Notably, it does not require any additional model evaluations (NFE) since both predictions are already computed during the P&P process. Please refer to the paper for more details.

Results


Motion Enhanced Video Generation with Wan2.2-A14B T2V

Wan2.2-A14B already generates human motion reasonably well, but our method can substantially further improve it.

Image-to-Video Robotics Video Generation with Cosmos-predict2.5-2B

Our method is also applicable to Image-to-Video generation. Applied to Cosmos-Predict2.5-2B, it reduces common robotics artifacts such as unstable grasps and implausible interactions, and produces more consistent motion than rejection sampling with Cosmos-Reason1-7B (best-of-4). This may be useful for downstream tasks such as vision-language-action (VLA) models, where even small artifacts can significantly affect perception and action.

Table Robot
PAI-Bench-G evaluation results on robotics I2V generaiton. Grasp and Robot-QA are measure by Gemini 3 Flash.

Additional Results

See more results in our paper!

Citation


@article{jang2026selfrefining,
    title={Self-Refining Video Sampling}, 
    author={Sangwon Jang and Taekyung Ki and Jaehyeong Jo and Saining Xie and Jaehong Yoon and Sung Ju Hwang},
    year={2026},
    journal={arXiv preprint arXiv:2601.18577},
}
            

Acknowledgement


This page is based on REPA.