Table of Contents
Fetching ...

LaViDa-R1: Advancing Reasoning for Unified Multimodal Diffusion Language Models

Shufan Li, Yuchen Zhu, Jiuxiang Gu, Kangning Liu, Zhe Lin, Yongxin Chen, Molei Tao, Aditya Grover, Jason Kuen

TL;DR

This work proposes LaViDa-R1, a multimodal, general-purpose reasoning dLLM, which employs several novel training techniques, including answer-forcing, tree search, and complementary likelihood estimation, to enhance effectiveness and scalability.

Abstract

Diffusion language models (dLLMs) recently emerged as a promising alternative to auto-regressive LLMs. The latest works further extended it to multimodal understanding and generation tasks. In this work, we propose LaViDa-R1, a multimodal, general-purpose reasoning dLLM. Unlike existing works that build reasoning dLLMs through task-specific reinforcement learning, LaViDa-R1 incorporates diverse multimodal understanding and generation tasks in a unified manner. In particular, LaViDa-R1 is built with a novel unified post-training framework that seamlessly integrates supervised finetuning (SFT) and multi-task reinforcement learning (RL). It employs several novel training techniques, including answer-forcing, tree search, and complementary likelihood estimation, to enhance effectiveness and scalability. Extensive experiments demonstrate LaViDa-R1's strong performance on a wide range of multimodal tasks, including visual math reasoning, reason-intensive grounding, and image editing.

LaViDa-R1: Advancing Reasoning for Unified Multimodal Diffusion Language Models

TL;DR

This work proposes LaViDa-R1, a multimodal, general-purpose reasoning dLLM, which employs several novel training techniques, including answer-forcing, tree search, and complementary likelihood estimation, to enhance effectiveness and scalability.

Abstract

Diffusion language models (dLLMs) recently emerged as a promising alternative to auto-regressive LLMs. The latest works further extended it to multimodal understanding and generation tasks. In this work, we propose LaViDa-R1, a multimodal, general-purpose reasoning dLLM. Unlike existing works that build reasoning dLLMs through task-specific reinforcement learning, LaViDa-R1 incorporates diverse multimodal understanding and generation tasks in a unified manner. In particular, LaViDa-R1 is built with a novel unified post-training framework that seamlessly integrates supervised finetuning (SFT) and multi-task reinforcement learning (RL). It employs several novel training techniques, including answer-forcing, tree search, and complementary likelihood estimation, to enhance effectiveness and scalability. Extensive experiments demonstrate LaViDa-R1's strong performance on a wide range of multimodal tasks, including visual math reasoning, reason-intensive grounding, and image editing.
Paper Structure (40 sections, 29 equations, 11 figures, 9 tables, 2 algorithms)

This paper contains 40 sections, 29 equations, 11 figures, 9 tables, 2 algorithms.

Figures (11)

  • Figure 1: We introduce LaViDa-R1, a multimodal diffusion language model with strong reasoning capabilities across diverse tasks. LaViDa-R1 incorporates a novel unified post-training that significantly improves upon the base model LaViDa-O li2025lavidao and SFT baseline on visual math reasoning, visual question answering, image editing, and object grounding tasks.
  • Figure 2: Unified Post Training Framework of LaViDa-R1. At each training step, a generic data engine provides prompts-response pairs of $(\bm{y}^i,\bm{x}^i)$, and sample weights $A_i$, either by loading from a dataset or by online generation. The policy model is then used to compute the log-likelihood of each sequence $\log \pi_\theta(\bm{y}^i|\bm{x}_i)$. Finally, we optimize the proposed unified policy gradient objective.
  • Figure 3: Answer-Forcing. We initialize a partially masked sequence with ground truth answer injected at the end, and use the diffusion unmasking process to obtain the reasoning trace.
  • Figure 4: Tree Search. Given base group size $N$, we first sample $N$ i.i.d samples and evaluate the rewards. We then select the samples with the highest rewards and generate $N$ new samples from an early diffusion state of the best sample. This process is repeated $K$ times. In this example, $N=4$.
  • Figure 5: Ablation Studies of Unified Objective.
  • ...and 6 more figures