Table of Contents
Fetching ...

Draft-and-Target Sampling for Video Generation Policy

Qikang Zhang, Yingjie Lei, Wei Liu, Daochang Liu

Abstract

Video generation models have been used as a robot policy to predict the future states of executing a task conditioned on task description and observation. Previous works ignore their high computational cost and long inference time. To address this challenge, we propose Draft-and-Target Sampling, a novel diffusion inference paradigm for video generation policy that is training-free and can improve inference efficiency. We introduce a self-play denoising approach by utilizing two complementary denoising trajectories in a single model, draft sampling takes large steps to generate a global trajectory in a fast manner and target sampling takes small steps to verify it. To further speedup generation, we introduce token chunking and progressive acceptance strategy to reduce redundant computation. Experiments on three benchmarks show that our method can achieve up to 2.1x speedup and improve the efficiency of current state-of-the-art methods with minimal compromise to the success rate. Our code is available.

Draft-and-Target Sampling for Video Generation Policy

Abstract

Video generation models have been used as a robot policy to predict the future states of executing a task conditioned on task description and observation. Previous works ignore their high computational cost and long inference time. To address this challenge, we propose Draft-and-Target Sampling, a novel diffusion inference paradigm for video generation policy that is training-free and can improve inference efficiency. We introduce a self-play denoising approach by utilizing two complementary denoising trajectories in a single model, draft sampling takes large steps to generate a global trajectory in a fast manner and target sampling takes small steps to verify it. To further speedup generation, we introduce token chunking and progressive acceptance strategy to reduce redundant computation. Experiments on three benchmarks show that our method can achieve up to 2.1x speedup and improve the efficiency of current state-of-the-art methods with minimal compromise to the success rate. Our code is available.
Paper Structure (18 sections, 10 equations, 7 figures, 8 tables)

This paper contains 18 sections, 10 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: The Draft-and-Target Sampling Framework. In the first stage of DTS, the draft sampling generates a coarse denoising trajectory by taking large steps in a sequence manner, which provides a fast approximation of the denoising trajectory. Then the target sampling refines this trajectory in parallel by taking small steps, which leads to a more precise denoising trajectory. Finally, in verification stage we compare these two trajectories to determine if tokens are accepted or rejected. If a token is rejected, the draft sampling restarts from this position with its corresponding target token as the new starting point.
  • Figure 2: Token Chunking. In naive DTS, the draft sampling generates a long sequence in one pass, while token chunking divides the draft trajectory into smaller chunks that are processed chunk by chunk.
  • Figure 3: Meta-World: faucet-close. In this task, the agent is required to rotate the faucet handle and close it. As shown in comparison (d), the trajectory generated by DDIM-10 is markedly different from DDIM-100, with the faucet being rotated in an entirely different direction. As shown in comparison (e), our trajectory closely align with DDIM-100, both successfully closing the faucet.
  • Figure 4: Meta-World: Basketball. In this task, the agent needs to place the basketball into the basket. As shown in comparison (d), although the trajectory generated by DDIM-10 is close to DDIM-100, a ghosting artifact appears in the second-to-last frame where an extra fragment of the robotic arm is generated near the true arm, and in the final frame the basketball is positioned slightly off-center relative to the basket, which leads to fail in this task. In contrast, our trajectory closely aligned with DDIM-100, both successfully placing the basketball directly above the center of the basket.
  • Figure 5: Meta-World: Button-press-topdown. In this task, the agent is required to press the button from a top-down direction. As shown in comparison (d), although the trajectory generated by DDIM-10 is close to DDIM-100, a ghosting artifact appears in the second-to-last frame where an extra fragment of the robotic arm is generated near the true arm, and in the final framethe button not completely pushed down. In contrast, our trajectory closely aligned with DDIM-100, both successfully press the button all the way down.
  • ...and 2 more figures