Best of Both Worlds: Multimodal Reasoning and Generation via Unified Discrete Flow Matching

Onkar Susladkar; Tushar Prakash; Gayatri Deshmukh; Kiet A. Nguyen; Jiaxun Zhang; Adheesh Juvekar; Tianshu Bao; Lin Chai; Sparsh Mittal; Inderjit S Dhillon; Ismini Lourentzou

Best of Both Worlds: Multimodal Reasoning and Generation via Unified Discrete Flow Matching

Onkar Susladkar, Tushar Prakash, Gayatri Deshmukh, Kiet A. Nguyen, Jiaxun Zhang, Adheesh Juvekar, Tianshu Bao, Lin Chai, Sparsh Mittal, Inderjit S Dhillon, Ismini Lourentzou

TL;DR

UniDFlow addresses the gap between multimodal understanding and generation by unifying discrete diffusion with a frozen vision-language backbone and lightweight adapters. The method deploys a three-stage training pipeline (Stage I Text Alignment, Stage II Vision Alignment, Stage III Reference-Based Multimodal Preference Alignment) and introduces Mixture-of-LoRA Routing to dynamically compose adapters, guided by Time-Step Guided RMSNorm to preserve pretrained priors. A reference-anchored Direct Preference Optimization (mRef-DPO) aligns text and image outputs under identical conditioning, yielding improved faithfulness and controllability in generation and editing. Across eight benchmarks, UniDFlow achieves state-of-the-art results, demonstrates strong zero-shot editing and generation capabilities, and maintains parameter efficiency through adapters rather than full-end retraining.

Abstract

We propose UniDFlow, a unified discrete flow-matching framework for multimodal understanding, generation, and editing. It decouples understanding and generation via task-specific low-rank adapters, avoiding objective interference and representation entanglement, while a novel reference-based multimodal preference alignment optimizes relative outcomes under identical conditioning, improving faithfulness and controllability without large-scale retraining. UniDFlpw achieves SOTA performance across eight benchmarks and exhibits strong zero-shot generalization to tasks including inpainting, in-context image generation, reference-based editing, and compositional generation, despite no explicit task-specific training.

Best of Both Worlds: Multimodal Reasoning and Generation via Unified Discrete Flow Matching

TL;DR

Abstract

Paper Structure (21 sections, 9 equations, 22 figures, 9 tables)

This paper contains 21 sections, 9 equations, 22 figures, 9 tables.

Introduction
Related Work
Method
Preliminaries: Discrete Flow Matching
UniDFlow
Time-Step Guided RMSNorm
Stage I: Text Alignment
Stage II: Vision Alignment
Stage III: Reference-Based Multimodal Preference Alignment
Experiments
Multi-Modal Understanding
Text-to-Image Generation
Text-to-Image Editing
Ablations
Conclusion
...and 6 more sections

Figures (22)

Figure 1: We propose UniDFlow a unified multimodal diffusion framework that supports image understanding, generation, and thinking-based editing. The model performs visual reasoning for question answering, produces high-quality text-to-image generations across diverse scenes and subjects, and enables instruction-driven, multi-step image editing through structured reasoning.
Figure 2: Instruction-guided editing attention maps showing UniDFlow more precisely focuses on relevant regions than prior models.
Figure 3: Overview of Stage I (understanding via text alignment) and Stage II (generation via vision alignment) of UniDFlow.
Figure 4: Stage III of UniDFlow: reference-based multimodal preference alignment for improved faithfulness, controllability, and editing.
Figure 5: Multimodal reasoning from UniDFlow
...and 17 more figures

Best of Both Worlds: Multimodal Reasoning and Generation via Unified Discrete Flow Matching

TL;DR

Abstract

Best of Both Worlds: Multimodal Reasoning and Generation via Unified Discrete Flow Matching

Authors

TL;DR

Abstract

Table of Contents

Figures (22)