Table of Contents
Fetching ...

EasyTune: Efficient Step-Aware Fine-Tuning for Diffusion-Based Motion Generation

Xiaofeng Tan, Wanjiang Weng, Haodong Lei, Hongsong Wang

TL;DR

The paper tackles misalignment between diffusion-trained motion models and downstream objectives by introducing step-aware fine-tuning (EasyTune) that optimizes at each denoising step, avoiding recursive gradient backpropagation. To address scarce motion-reward data, it proposes Self-refining Preference Learning (SPL) that dynamically builds preference pairs from retrieval signals to train a motion-text reward model without human annotations. Empirically, EasyTune delivers state-of-the-art alignment and quality on text-to-motion and motion-diffusion tasks, with significantly reduced memory and up to multi-fold training speedups. The work demonstrates robust performance across multiple pre-trained backbones and datasets, and provides extensive ablations, user studies, and analyses of reward perception and potential reward-hacking mitigation. These contributions offer a practical, scalable path to better semantic alignment in diffusion-based motion generation, with broad implications for future reward design and efficient fine-tuning workflows.

Abstract

In recent years, motion generative models have undergone significant advancement, yet pose challenges in aligning with downstream objectives. Recent studies have shown that using differentiable rewards to directly align the preference of diffusion models yields promising results. However, these methods suffer from (1) inefficient and coarse-grained optimization with (2) high memory consumption. In this work, we first theoretically and empirically identify the key reason of these limitations: the recursive dependence between different steps in the denoising trajectory. Inspired by this insight, we propose EasyTune, which fine-tunes diffusion at each denoising step rather than over the entire trajectory. This decouples the recursive dependence, allowing us to perform (1) a dense and fine-grained, and (2) memory-efficient optimization. Furthermore, the scarcity of preference motion pairs restricts the availability of motion reward model training. To this end, we further introduce a Self-refinement Preference Learning (SPL) mechanism that dynamically identifies preference pairs and conducts preference learning. Extensive experiments demonstrate that EasyTune outperforms DRaFT-50 by 8.2% in alignment (MM-Dist) improvement while requiring only 31.16% of its additional memory overhead and achieving a 7.3x training speedup. The project page is available at this link {https://xiaofeng-tan.github.io/projects/EasyTune/index.html}.

EasyTune: Efficient Step-Aware Fine-Tuning for Diffusion-Based Motion Generation

TL;DR

The paper tackles misalignment between diffusion-trained motion models and downstream objectives by introducing step-aware fine-tuning (EasyTune) that optimizes at each denoising step, avoiding recursive gradient backpropagation. To address scarce motion-reward data, it proposes Self-refining Preference Learning (SPL) that dynamically builds preference pairs from retrieval signals to train a motion-text reward model without human annotations. Empirically, EasyTune delivers state-of-the-art alignment and quality on text-to-motion and motion-diffusion tasks, with significantly reduced memory and up to multi-fold training speedups. The work demonstrates robust performance across multiple pre-trained backbones and datasets, and provides extensive ablations, user studies, and analyses of reward perception and potential reward-hacking mitigation. These contributions offer a practical, scalable path to better semantic alignment in diffusion-based motion generation, with broad implications for future reward design and efficient fine-tuning workflows.

Abstract

In recent years, motion generative models have undergone significant advancement, yet pose challenges in aligning with downstream objectives. Recent studies have shown that using differentiable rewards to directly align the preference of diffusion models yields promising results. However, these methods suffer from (1) inefficient and coarse-grained optimization with (2) high memory consumption. In this work, we first theoretically and empirically identify the key reason of these limitations: the recursive dependence between different steps in the denoising trajectory. Inspired by this insight, we propose EasyTune, which fine-tunes diffusion at each denoising step rather than over the entire trajectory. This decouples the recursive dependence, allowing us to perform (1) a dense and fine-grained, and (2) memory-efficient optimization. Furthermore, the scarcity of preference motion pairs restricts the availability of motion reward model training. To this end, we further introduce a Self-refinement Preference Learning (SPL) mechanism that dynamically identifies preference pairs and conducts preference learning. Extensive experiments demonstrate that EasyTune outperforms DRaFT-50 by 8.2% in alignment (MM-Dist) improvement while requiring only 31.16% of its additional memory overhead and achieving a 7.3x training speedup. The project page is available at this link {https://xiaofeng-tan.github.io/projects/EasyTune/index.html}.
Paper Structure (42 sections, 5 theorems, 54 equations, 18 figures, 15 tables, 2 algorithms)

This paper contains 42 sections, 5 theorems, 54 equations, 18 figures, 15 tables, 2 algorithms.

Key Result

Corollary 1

Given the reverse process in Eq. eq:reverse_process, $\mathbf{x}_{t-1}^\theta = \pi_\theta(\mathbf{x}_t^\theta, t, c)$, the gradient w.r.t diffusion model $\theta$, denoted as $\tfrac{\partial \mathbf{x}^\theta_{t-1}}{\partial \theta}$, can be expressed as:

Figures (18)

  • Figure 1: Comparison of the training costs and generation performance on HumanML3Dguo2022generating. (a) Performance comparison of different fine-tuning methods clark2024directlyprabhudesai2023aligningwu2025drtune. (b) Generalization performance across six pre-trained diffusion-based models chen2023executingmotionlcm-v2Dai2025tevet2023humanzhang2022motiondiffuse.
  • Figure 2: The framework of existing differentiable reward-based methods (left) and our proposed EasyTune (right). Existing methods backpropagate the gradients of the reward model through the overall denoising process, resulting in (1) excessive memory, (2) inefficient, and (3) coarse-grained optimization. In contrast, EasyTune optimizes the diffusion model by directly backpropagating the gradients at each denoising step, overcoming these issues.
  • Figure 3: Gradient norm with respect to denoising steps. Here, $\text{dim}(\cdot)$ denotes the gradient dimension. Detailed settings are provided in App. \ref{['supp:grad_analysis']}.
  • Figure 4: Similarity between $t$-th step noised and clean motion.
  • Figure 5: Core insight of EasyTune. By replacing the recursive gradient in Eq.\ref{['eq:recursive_gradient']} with step-level ones in Eq.\ref{['eq:ourgradient']}, EasyTune removes recursive dependencies, enabling (1) step-wise graph storage, (2) efficiency, and (3) fine-grained optimization. See App. \ref{['supp:discussion']} for pseudocode and discussion.
  • ...and 13 more figures

Theorems & Definitions (9)

  • Corollary 1
  • Corollary 2
  • Corollary
  • proof
  • Theorem S1: Convergence of EasyTune
  • proof
  • Corollary S1: Asymptotic stationarity of EasyTune
  • proof
  • proof