Table of Contents
Fetching ...

OSPO: Object-Centric Self-Improving Preference Optimization for Text-to-Image Generation

Yoonjin Oh, Yongjin Kim, Hyomin Kim, Donghwan Chi, Sungwoong Kim

TL;DR

Object-centric Self-improving Preference Optimization (OSPO), a self-improving framework designed to enhance object-level text-image alignment, and a new approach that leverages attention-based object masks together with an object-weighted SimPO loss to enhance object-specific fidelity are introduced.

Abstract

Recent advances in Multimodal Large Language Models (MLLMs) have enabled unified multimodal understanding and generation. However, they still struggle with fine-grained text-image alignment, often failing to faithfully depict objects with correct attributes such as color, shape, and spatial relations. To mitigate this issue, previous studies have explored preference optimization methods such as DPO and GRPO, but these approaches incur substantial computational cost, both in constructing preference data and in performing optimization. This has motivated self-improving preference optimization approaches, in which the MLLM autonomously generates its own training data, self-estimates preference feedback, and self-optimizes using the resulting self-constructed preference pairs. However, existing self-improving methods still overlook fine-grained, object-level semantics, allowing object hallucination to persist. To tackle this problem, we propose Object-centric Self-improving Preference Optimization (OSPO), a self-improving framework designed to enhance object-level text-image alignment. OSPO explicitly constructs object-centric preference data without relying on any external data and external models. We also introduce a new approach that leverages attention-based object masks together with an object-weighted SimPO loss to enhance object-specific fidelity. Extensive experiments on three compositional image generation benchmarks demonstrate that OSPO significantly improves fine-grained alignment and reduces object hallucination, outperforming prior self-improving methods and even specialized diffusion-based text-to-image models.

OSPO: Object-Centric Self-Improving Preference Optimization for Text-to-Image Generation

TL;DR

Object-centric Self-improving Preference Optimization (OSPO), a self-improving framework designed to enhance object-level text-image alignment, and a new approach that leverages attention-based object masks together with an object-weighted SimPO loss to enhance object-specific fidelity are introduced.

Abstract

Recent advances in Multimodal Large Language Models (MLLMs) have enabled unified multimodal understanding and generation. However, they still struggle with fine-grained text-image alignment, often failing to faithfully depict objects with correct attributes such as color, shape, and spatial relations. To mitigate this issue, previous studies have explored preference optimization methods such as DPO and GRPO, but these approaches incur substantial computational cost, both in constructing preference data and in performing optimization. This has motivated self-improving preference optimization approaches, in which the MLLM autonomously generates its own training data, self-estimates preference feedback, and self-optimizes using the resulting self-constructed preference pairs. However, existing self-improving methods still overlook fine-grained, object-level semantics, allowing object hallucination to persist. To tackle this problem, we propose Object-centric Self-improving Preference Optimization (OSPO), a self-improving framework designed to enhance object-level text-image alignment. OSPO explicitly constructs object-centric preference data without relying on any external data and external models. We also introduce a new approach that leverages attention-based object masks together with an object-weighted SimPO loss to enhance object-specific fidelity. Extensive experiments on three compositional image generation benchmarks demonstrate that OSPO significantly improves fine-grained alignment and reduces object hallucination, outperforming prior self-improving methods and even specialized diffusion-based text-to-image models.

Paper Structure

This paper contains 38 sections, 2 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Examples and category distribution of (Left) preference-null and (Right) preference-false image pairs.
  • Figure 2: Overview of OSPO framework: (Stage 1) The MLLM generates a set of base text prompts. (Stage 2) For each base prompt, the model generates multiple perturbed variants using three strategies (Replace, Swap, and Drop) and each original–perturbed pair is jointly densified. Each pair of original and perturbed prompt is pairwise densified by the MLLM. (Stage 3) The MLLM generates candidate preferred and non-preferred images from each densified prompt pair. (Stage 4) The model constructs atomic decompositional VQA questions and evaluates each candidate image’s prompt fidelity using the Self-VQA alignment score S, filtering out noisy supervision and selecting a single final image pair. (Stage 5) The MLLM is fine-tuned using preference optimization with the proposed Object-weighted SimPO loss together with SFT losses.
  • Figure 3: Qualitative examples from Janus-Pro-7B + OSPO on T2I-CompBench++, GenEval, and DPGBench prompts.
  • Figure 4: Category-wise comparison of preference-null image pairs between the (Left) Best-of-N baseline and (Right) our OSPO strategy. Lower is better.
  • Figure 5: (Left) Effect of sample size on T2I-Compbench++ (attribute) and GenEval (overall) scores. (Right) Effect of candidate image pair size on T2I-Compbench++ (attribute) and GenEval (overall) scores. Higher is better for all benchmarks.
  • ...and 1 more figures