Table of Contents
Fetching ...

Do Understanding and Generation Fight? A Diagnostic Study of DPO for Unified Multimodal Models

Abinav Rao, Sujan Rachuri

Abstract

Unified multimodal models share a language model backbone for both understanding and generating images. Can DPO align both capabilities simultaneously? We present the first systematic study of this question, applying DPO to Janus-Pro at 1B and 7B parameters under seven training strategies and two post-hoc methods. The central finding is negative: generation quality resists DPO alignment across all tested conditions on this architecture. No method improves generation CLIPScore at 7B (|Delta| < 0.2, p > 0.5 at n=200 per seed, 3 seeds); at 1B, all methods degrade generation, and the result holds across preference data types (real-vs-generated and model-vs-model) and the data volumes tested (150-288 pairs). Gradient analysis reveals why: understanding and generation gradients are near-orthogonal (cos ~ 0) with ~11-14x magnitude imbalance driven by VQ token count asymmetry (576 generation tokens vs. ~30-100 text tokens). This imbalance is the dominant interference mechanism in multi-task DPO; magnitude-balancing yields directionally positive understanding deltas (+0.01-0.04 VQA, though individually not significant), but the generation gap persists regardless. We identify discrete VQ tokenization as a likely structural bottleneck -- supported by the generation DPO loss converging to ln(2) -- and provide practical guidance for practitioners working with VQ-based unified models.

Do Understanding and Generation Fight? A Diagnostic Study of DPO for Unified Multimodal Models

Abstract

Unified multimodal models share a language model backbone for both understanding and generating images. Can DPO align both capabilities simultaneously? We present the first systematic study of this question, applying DPO to Janus-Pro at 1B and 7B parameters under seven training strategies and two post-hoc methods. The central finding is negative: generation quality resists DPO alignment across all tested conditions on this architecture. No method improves generation CLIPScore at 7B (|Delta| < 0.2, p > 0.5 at n=200 per seed, 3 seeds); at 1B, all methods degrade generation, and the result holds across preference data types (real-vs-generated and model-vs-model) and the data volumes tested (150-288 pairs). Gradient analysis reveals why: understanding and generation gradients are near-orthogonal (cos ~ 0) with ~11-14x magnitude imbalance driven by VQ token count asymmetry (576 generation tokens vs. ~30-100 text tokens). This imbalance is the dominant interference mechanism in multi-task DPO; magnitude-balancing yields directionally positive understanding deltas (+0.01-0.04 VQA, though individually not significant), but the generation gap persists regardless. We identify discrete VQ tokenization as a likely structural bottleneck -- supported by the generation DPO loss converging to ln(2) -- and provide practical guidance for practitioners working with VQ-based unified models.
Paper Structure (107 sections, 10 equations, 5 figures, 15 tables)

This paper contains 107 sections, 10 equations, 5 figures, 15 tables.

Figures (5)

  • Figure 1: Janus-Pro architecture and gradient imbalance. The understanding path (blue) encodes images via SigLIP into visual tokens and produces text responses (${\sim}$30--100 tokens), while the generation path (orange) decodes 576 VQ tokens through a VQ-VAE. Both paths share the LLM backbone (gray). During multi-task DPO, generation gradients $\bm{g}_G$ are ${\sim}$14$\times$ larger than understanding gradients $\bm{g}_U$ (dashed arrows, thickness proportional to magnitude), drowning out the understanding signal in the shared parameters.
  • Figure 2: Gradient diagnostics for Janus-Pro-7B (200 mini-batches). Left: Cosine similarity between understanding and generation gradients. Values center at zero, confirming that the gradients are near-orthogonal and occupy separate subspaces. Right: Magnitude ratio $\|\bm{g}_U\| / \|\bm{g}_G\|$. Generation gradients are ${\sim}$14$\times$ larger, starving the understanding signal under equal weighting. Janus-Pro-1B shows qualitatively identical magnitude imbalance with slight anti-alignment (mean cosine $= -0.003$).
  • Figure 3: Training dynamics of gradient-weighted Balanced DPO (Janus-Pro-7B, 1,000 steps). Top-left: Gradient cosine similarity remains near zero throughout, confirming persistent orthogonality. Bottom-left: Dynamic weights shift sharply at step 50 ($w_U$: $0.5 \to {\sim}0.80$), reaching ${\sim}0.93$ by step 200 and stabilizing. Top-right: Task-specific DPO losses; generation loss has higher variance due to discrete VQ token similarities. Bottom-right: Combined loss.
  • Figure 4: Per-layer gradient cosine similarity across all 30 transformer layers (Janus-Pro-7B). lora_B cosines range from $-0.01$ to $+0.01$ with no systematic depth-dependent trend, confirming that orthogonality is uniform across all depths.
  • Figure 5: Understanding--generation trade-off across all methods (Janus-Pro-7B, mean over 3 seeds, $n{=}200$ per seed). Methods cluster in a narrow CLIPScore band, confirming no significant generation improvement. The primary variation is along the understanding axis, where understanding-only DPO and magnitude-balancing methods achieve the largest gains.