Best of Both Worlds: Multimodal Reasoning and Generation via Unified Discrete Flow Matching
Onkar Susladkar, Tushar Prakash, Gayatri Deshmukh, Kiet A. Nguyen, Jiaxun Zhang, Adheesh Juvekar, Tianshu Bao, Lin Chai, Sparsh Mittal, Inderjit S Dhillon, Ismini Lourentzou
TL;DR
UniDFlow addresses the gap between multimodal understanding and generation by unifying discrete diffusion with a frozen vision-language backbone and lightweight adapters. The method deploys a three-stage training pipeline (Stage I Text Alignment, Stage II Vision Alignment, Stage III Reference-Based Multimodal Preference Alignment) and introduces Mixture-of-LoRA Routing to dynamically compose adapters, guided by Time-Step Guided RMSNorm to preserve pretrained priors. A reference-anchored Direct Preference Optimization (mRef-DPO) aligns text and image outputs under identical conditioning, yielding improved faithfulness and controllability in generation and editing. Across eight benchmarks, UniDFlow achieves state-of-the-art results, demonstrates strong zero-shot editing and generation capabilities, and maintains parameter efficiency through adapters rather than full-end retraining.
Abstract
We propose UniDFlow, a unified discrete flow-matching framework for multimodal understanding, generation, and editing. It decouples understanding and generation via task-specific low-rank adapters, avoiding objective interference and representation entanglement, while a novel reference-based multimodal preference alignment optimizes relative outcomes under identical conditioning, improving faithfulness and controllability without large-scale retraining. UniDFlpw achieves SOTA performance across eight benchmarks and exhibits strong zero-shot generalization to tasks including inpainting, in-context image generation, reference-based editing, and compositional generation, despite no explicit task-specific training.
