Table of Contents
Fetching ...

SILMM: Self-Improving Large Multimodal Models for Compositional Text-to-Image Generation

Leigang Qu, Haochuan Li, Wenjie Wang, Xiang Liu, Juncheng Li, Liqiang Nie, Tat-Seng Chua

TL;DR

SILMM introduces a model-agnostic, self-improving framework for aligning large multimodal models to compositional text-to-image prompts without external feedback. It combines a five-step iterative loop with two alignment strategies: discrete DPO for token-based representations and Kernel-based Continuous DPO (KC-DPO) for continuous visual features, enhanced by a DropDiv diversification mechanism. A decompositional self-questioning strategy and VQA-based self-feedback enable self-assessed, self-guided improvement, leading to substantial gains on three compositional T2I benchmarks (e.g., >30% on T2I-CompBench++ and ~20% on DPG-Bench). Empirical results on DreamLLM and SEED-LLaMA demonstrate both the generality and effectiveness of SILMM, with ablations highlighting the importance of diversification, question-driven feedback, and kernel choices in KC-DPO. The work indicates a scalable path toward autonomous improvement of LMMs in multimodal generation tasks and lays groundwork for further efficiency and capability enhancements.

Abstract

Large Multimodal Models (LMMs) have demonstrated impressive capabilities in multimodal understanding and generation, pushing forward advancements in text-to-image generation. However, achieving accurate text-image alignment for LMMs, particularly in compositional scenarios, remains challenging. Existing approaches, such as layout planning for multi-step generation and learning from human feedback or AI feedback, depend heavily on prompt engineering, costly human annotations, and continual upgrading, limiting flexibility and scalability. In this work, we introduce a model-agnostic iterative self-improvement framework (SILMM) that can enable LMMs to provide helpful and scalable self-feedback and optimize text-image alignment via Direct Preference Optimization (DPO). DPO can readily applied to LMMs that use discrete visual tokens as intermediate image representations; while it is less suitable for LMMs with continuous visual features, as obtaining generation probabilities is challenging. To adapt SILMM to LMMs with continuous features, we propose a diversity mechanism to obtain diverse representations and a kernel-based continuous DPO for alignment. Extensive experiments on three compositional text-to-image generation benchmarks validate the effectiveness and superiority of SILMM, showing improvements exceeding 30% on T2I-CompBench++ and around 20% on DPG-Bench.

SILMM: Self-Improving Large Multimodal Models for Compositional Text-to-Image Generation

TL;DR

SILMM introduces a model-agnostic, self-improving framework for aligning large multimodal models to compositional text-to-image prompts without external feedback. It combines a five-step iterative loop with two alignment strategies: discrete DPO for token-based representations and Kernel-based Continuous DPO (KC-DPO) for continuous visual features, enhanced by a DropDiv diversification mechanism. A decompositional self-questioning strategy and VQA-based self-feedback enable self-assessed, self-guided improvement, leading to substantial gains on three compositional T2I benchmarks (e.g., >30% on T2I-CompBench++ and ~20% on DPG-Bench). Empirical results on DreamLLM and SEED-LLaMA demonstrate both the generality and effectiveness of SILMM, with ablations highlighting the importance of diversification, question-driven feedback, and kernel choices in KC-DPO. The work indicates a scalable path toward autonomous improvement of LMMs in multimodal generation tasks and lays groundwork for further efficiency and capability enhancements.

Abstract

Large Multimodal Models (LMMs) have demonstrated impressive capabilities in multimodal understanding and generation, pushing forward advancements in text-to-image generation. However, achieving accurate text-image alignment for LMMs, particularly in compositional scenarios, remains challenging. Existing approaches, such as layout planning for multi-step generation and learning from human feedback or AI feedback, depend heavily on prompt engineering, costly human annotations, and continual upgrading, limiting flexibility and scalability. In this work, we introduce a model-agnostic iterative self-improvement framework (SILMM) that can enable LMMs to provide helpful and scalable self-feedback and optimize text-image alignment via Direct Preference Optimization (DPO). DPO can readily applied to LMMs that use discrete visual tokens as intermediate image representations; while it is less suitable for LMMs with continuous visual features, as obtaining generation probabilities is challenging. To adapt SILMM to LMMs with continuous features, we propose a diversity mechanism to obtain diverse representations and a kernel-based continuous DPO for alignment. Extensive experiments on three compositional text-to-image generation benchmarks validate the effectiveness and superiority of SILMM, showing improvements exceeding 30% on T2I-CompBench++ and around 20% on DPG-Bench.

Paper Structure

This paper contains 22 sections, 15 equations, 15 figures, 7 tables.

Figures (15)

  • Figure 1: Illustration of (a) text-image misalignment in compositional prompts and (b) comparison of discrete and continuous LMMs for T2I. Given a prompt, discrete LMMs can sample diverse token sequences from categorical distributions, while continuous LMMs can only produce a single deterministic feature vector. Note that the input learnable embeddings are optional for some continuous LMMs sun2024generative.
  • Figure 2: Schematic illustration of SILMM, comprising five steps: 1) LMMs generate compositional prompts by sampling based on provided instructions. 2) Diverse representations and images are generated using either discrete nucleus sampling or the proposed continuous DivDrop. 3) LMMs divide each compositional prompt into semantic units and generate questions for each unit. 4) VQA is conducted to answer these questions, with the answers and likelihoods aggregated into alignment scores as self-feedback. 5) For alignment tuning, DPO is applied for discrete LMMs, while the proposed KC-DPO is used for continuous LMMs.
  • Figure 3: Performance improvement of iterative alignment tuning based on SEED-LLaMA and DreamLLM, across 8 detailed categories of T2I-CompBench++. Iter. 0 denotes the base models without alignment tuning.
  • Figure 4: Overall alignment scores of SEED-LLaMA with discrete DPO and DreamLLM with continuous KC-DPO, on T2I-CompBench++ with (a) varying numbers of generated prompts in the training data, and (b) different number of preference pairs sampled from 30 diverse generated images per prompt. $N \times N$ means we select the top-N and last-N images from 30 generated ones as the chosen and rejected, respectively.
  • Figure 5: Comparison of four methods for diverse continuous representation generation, with alignment scores evaluated on the validation set of T2I-CompBench++. For each prompt, DreamLLM generates ten diverse representations and corresponding images.
  • ...and 10 more figures