Table of Contents
Fetching ...

DyMO: Training-Free Diffusion Model Alignment with Dynamic Multi-Objective Scheduling

Xin Xie, Dong Gong

TL;DR

DyMO addresses the challenge of training-free diffusion-model alignment by jointly leveraging a text-aware human preference signal and a semantic alignment objective derived from attention maps. It introduces dynamic scheduling to balance these objectives across denoising steps and employs a dynamic time-travel recurrence with a Polyak-style update to improve guidance efficiency. The approach relies on an LLM-driven semantic graph to map prompts to entity-attribute structures and uses attention-based semantics to guide early-stage content, while later stages refine visuals via preference feedback. Empirical results across diverse backbones (e.g., SD V1.5, SDXL) show consistent improvements over both training-based and training-free baselines in objective metrics and human evaluations, with notable gains in layout fidelity and visual aesthetics and with favorable runtime characteristics. Overall, DyMO provides a practical, training-free pathway to align diffusion outputs with complex user preferences and semantics, expanding the applicability of high-quality, user-aligned image synthesis.

Abstract

Text-to-image diffusion model alignment is critical for improving the alignment between the generated images and human preferences. While training-based methods are constrained by high computational costs and dataset requirements, training-free alignment methods remain underexplored and are often limited by inaccurate guidance. We propose a plug-and-play training-free alignment method, DyMO, for aligning the generated images and human preferences during inference. Apart from text-aware human preference scores, we introduce a semantic alignment objective for enhancing the semantic alignment in the early stages of diffusion, relying on the fact that the attention maps are effective reflections of the semantics in noisy images. We propose dynamic scheduling of multiple objectives and intermediate recurrent steps to reflect the requirements at different steps. Experiments with diverse pre-trained diffusion models and metrics demonstrate the effectiveness and robustness of the proposed method.

DyMO: Training-Free Diffusion Model Alignment with Dynamic Multi-Objective Scheduling

TL;DR

DyMO addresses the challenge of training-free diffusion-model alignment by jointly leveraging a text-aware human preference signal and a semantic alignment objective derived from attention maps. It introduces dynamic scheduling to balance these objectives across denoising steps and employs a dynamic time-travel recurrence with a Polyak-style update to improve guidance efficiency. The approach relies on an LLM-driven semantic graph to map prompts to entity-attribute structures and uses attention-based semantics to guide early-stage content, while later stages refine visuals via preference feedback. Empirical results across diverse backbones (e.g., SD V1.5, SDXL) show consistent improvements over both training-based and training-free baselines in objective metrics and human evaluations, with notable gains in layout fidelity and visual aesthetics and with favorable runtime characteristics. Overall, DyMO provides a practical, training-free pathway to align diffusion outputs with complex user preferences and semantics, expanding the applicability of high-quality, user-aligned image synthesis.

Abstract

Text-to-image diffusion model alignment is critical for improving the alignment between the generated images and human preferences. While training-based methods are constrained by high computational costs and dataset requirements, training-free alignment methods remain underexplored and are often limited by inaccurate guidance. We propose a plug-and-play training-free alignment method, DyMO, for aligning the generated images and human preferences during inference. Apart from text-aware human preference scores, we introduce a semantic alignment objective for enhancing the semantic alignment in the early stages of diffusion, relying on the fact that the attention maps are effective reflections of the semantics in noisy images. We propose dynamic scheduling of multiple objectives and intermediate recurrent steps to reflect the requirements at different steps. Experiments with diverse pre-trained diffusion models and metrics demonstrate the effectiveness and robustness of the proposed method.

Paper Structure

This paper contains 31 sections, 10 equations, 15 figures, 10 tables, 1 algorithm.

Figures (15)

  • Figure 1: The framework of our method. (a) Given a user prompt, we use the LLMs to identify the entities and corresponding attributes for knowledge graph construction. Then we design a semantic alignment objective via cross attention map alignment based on graph, cooperating with a pre-trained preference model to dynamically guide the denoising process for high-quality image generation. (b) The entire denoising process of one-step predicted clean images under the guidance of our method.
  • Figure 2: Qualitative comparison based on SD V1.5 backbones.
  • Figure 3: Qualitative comparison based on SDXL backbones.
  • Figure 4: The case comparison of improvements between baseline model and our proposed method.
  • Figure 5: User study results.
  • ...and 10 more figures