Table of Contents
Fetching ...

Best of Both Worlds: Multimodal Reasoning and Generation via Unified Discrete Flow Matching

Onkar Susladkar, Tushar Prakash, Gayatri Deshmukh, Kiet A. Nguyen, Jiaxun Zhang, Adheesh Juvekar, Tianshu Bao, Lin Chai, Sparsh Mittal, Inderjit S Dhillon, Ismini Lourentzou

TL;DR

UniDFlow addresses the gap between multimodal understanding and generation by unifying discrete diffusion with a frozen vision-language backbone and lightweight adapters. The method deploys a three-stage training pipeline (Stage I Text Alignment, Stage II Vision Alignment, Stage III Reference-Based Multimodal Preference Alignment) and introduces Mixture-of-LoRA Routing to dynamically compose adapters, guided by Time-Step Guided RMSNorm to preserve pretrained priors. A reference-anchored Direct Preference Optimization (mRef-DPO) aligns text and image outputs under identical conditioning, yielding improved faithfulness and controllability in generation and editing. Across eight benchmarks, UniDFlow achieves state-of-the-art results, demonstrates strong zero-shot editing and generation capabilities, and maintains parameter efficiency through adapters rather than full-end retraining.

Abstract

We propose UniDFlow, a unified discrete flow-matching framework for multimodal understanding, generation, and editing. It decouples understanding and generation via task-specific low-rank adapters, avoiding objective interference and representation entanglement, while a novel reference-based multimodal preference alignment optimizes relative outcomes under identical conditioning, improving faithfulness and controllability without large-scale retraining. UniDFlpw achieves SOTA performance across eight benchmarks and exhibits strong zero-shot generalization to tasks including inpainting, in-context image generation, reference-based editing, and compositional generation, despite no explicit task-specific training.

Best of Both Worlds: Multimodal Reasoning and Generation via Unified Discrete Flow Matching

TL;DR

UniDFlow addresses the gap between multimodal understanding and generation by unifying discrete diffusion with a frozen vision-language backbone and lightweight adapters. The method deploys a three-stage training pipeline (Stage I Text Alignment, Stage II Vision Alignment, Stage III Reference-Based Multimodal Preference Alignment) and introduces Mixture-of-LoRA Routing to dynamically compose adapters, guided by Time-Step Guided RMSNorm to preserve pretrained priors. A reference-anchored Direct Preference Optimization (mRef-DPO) aligns text and image outputs under identical conditioning, yielding improved faithfulness and controllability in generation and editing. Across eight benchmarks, UniDFlow achieves state-of-the-art results, demonstrates strong zero-shot editing and generation capabilities, and maintains parameter efficiency through adapters rather than full-end retraining.

Abstract

We propose UniDFlow, a unified discrete flow-matching framework for multimodal understanding, generation, and editing. It decouples understanding and generation via task-specific low-rank adapters, avoiding objective interference and representation entanglement, while a novel reference-based multimodal preference alignment optimizes relative outcomes under identical conditioning, improving faithfulness and controllability without large-scale retraining. UniDFlpw achieves SOTA performance across eight benchmarks and exhibits strong zero-shot generalization to tasks including inpainting, in-context image generation, reference-based editing, and compositional generation, despite no explicit task-specific training.
Paper Structure (21 sections, 9 equations, 22 figures, 9 tables)

This paper contains 21 sections, 9 equations, 22 figures, 9 tables.

Figures (22)

  • Figure 1: We propose UniDFlow a unified multimodal diffusion framework that supports image understanding, generation, and thinking-based editing. The model performs visual reasoning for question answering, produces high-quality text-to-image generations across diverse scenes and subjects, and enables instruction-driven, multi-step image editing through structured reasoning.
  • Figure 2: Instruction-guided editing attention maps showing UniDFlow more precisely focuses on relevant regions than prior models.
  • Figure 3: Overview of Stage I (understanding via text alignment) and Stage II (generation via vision alignment) of UniDFlow.
  • Figure 4: Stage III of UniDFlow: reference-based multimodal preference alignment for improved faithfulness, controllability, and editing.
  • Figure 5: Multimodal reasoning from UniDFlow
  • ...and 17 more figures