Table of Contents
Fetching ...

Analyzing Diffusion and Autoregressive Vision Language Models in Multimodal Embedding Space

Zihang Wang, Siyue Zhang, Yilun Zhao, Jingyi Yang, Tingyu Song, Anh Tuan Luu, Chen Zhao

TL;DR

This work systematically evaluates diffusion-based multimodal language models as embedding providers and compares them to autoregressive VLMs across classification, VQA, and retrieval. By fine-tuning with a contrastive objective and using a VLM2Vec pipeline, it shows that diffusion embeddings typically lag behind autoregressive counterparts, with LaViDa being the most competitive yet still inferior on in-domain tasks while showing stronger out-of-domain resilience. The study attributes the performance gap largely to weaker image-text alignment in diffusion VLMs and explores data efficiency and alignment through targeted analyses. These findings suggest that, as of now, diffusion-based embeddings offer limited advantages for multimodal embedding tasks and highlight the need for improved cross-modal alignment or alternative training objectives to close the gap.

Abstract

Embedding models are a fundamental component of modern AI systems such as semantic search and retrieval-augmented generation. Recent advances in large foundation models have substantially accelerated the development of embedding models, including those based on Large Language Models (LLMs), Vision Language Models (VLMs), and Multimodal LLMs. More recently, Large Diffusion Language Models (dLLMs) and Multimodal dLLMs have emerged as competitive alternatives to autoregressive models, offering advantages such as bidirectional attention and parallel generation. This progress naturally raises a critical yet unexplored question: can Multimodal dLLMs serve as effective multimodal embedding models? To answer this, we present the first systematic study of converting Multimodal dLLMs into embedding models. We evaluate state-of-the-art Multimodal dLLMs and Autoregressive VLMs across three categories of embedding tasks: classification, visual question answering, and information retrieval. Our results show that Multimodal dLLM embeddings generally underperform their autoregressive VLM counterparts. The stronger diffusion-based model, LaViDa, lags by only 3.5 points on classification, 2.5 points on VQA, and 4.4 points on retrieval tasks, whereas the other diffusion-based model, MMaDA, exhibits substantially larger performance gaps, exceeding 20 points across all tasks. Further analysis reveals insufficient image-text alignment in diffusion-based models, accounting for the observed limitations in their embedding performance.

Analyzing Diffusion and Autoregressive Vision Language Models in Multimodal Embedding Space

TL;DR

This work systematically evaluates diffusion-based multimodal language models as embedding providers and compares them to autoregressive VLMs across classification, VQA, and retrieval. By fine-tuning with a contrastive objective and using a VLM2Vec pipeline, it shows that diffusion embeddings typically lag behind autoregressive counterparts, with LaViDa being the most competitive yet still inferior on in-domain tasks while showing stronger out-of-domain resilience. The study attributes the performance gap largely to weaker image-text alignment in diffusion VLMs and explores data efficiency and alignment through targeted analyses. These findings suggest that, as of now, diffusion-based embeddings offer limited advantages for multimodal embedding tasks and highlight the need for improved cross-modal alignment or alternative training objectives to close the gap.

Abstract

Embedding models are a fundamental component of modern AI systems such as semantic search and retrieval-augmented generation. Recent advances in large foundation models have substantially accelerated the development of embedding models, including those based on Large Language Models (LLMs), Vision Language Models (VLMs), and Multimodal LLMs. More recently, Large Diffusion Language Models (dLLMs) and Multimodal dLLMs have emerged as competitive alternatives to autoregressive models, offering advantages such as bidirectional attention and parallel generation. This progress naturally raises a critical yet unexplored question: can Multimodal dLLMs serve as effective multimodal embedding models? To answer this, we present the first systematic study of converting Multimodal dLLMs into embedding models. We evaluate state-of-the-art Multimodal dLLMs and Autoregressive VLMs across three categories of embedding tasks: classification, visual question answering, and information retrieval. Our results show that Multimodal dLLM embeddings generally underperform their autoregressive VLM counterparts. The stronger diffusion-based model, LaViDa, lags by only 3.5 points on classification, 2.5 points on VQA, and 4.4 points on retrieval tasks, whereas the other diffusion-based model, MMaDA, exhibits substantially larger performance gaps, exceeding 20 points across all tasks. Further analysis reveals insufficient image-text alignment in diffusion-based models, accounting for the observed limitations in their embedding performance.
Paper Structure (59 sections, 1 equation, 4 figures, 6 tables)

This paper contains 59 sections, 1 equation, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Average performance on three multimodal embedding meta-tasks. Overall, Diffusion VLM embeddings underperform Autoregressive VLM embeddings, despite the use of bidirectional attention; however, performance varies substantially across diffusion models, with LaViDa remaining competitive while MMaDA shows a substantial gap.
  • Figure 2: General architectures of Autoregressive VLMs (left) and Diffusion VLMs (right). For both types of models, the input image is encoded into tokens by a vision encoder (e.g., SigLIP siglip2 and VQ-GAN vqgan) and interleaved with text tokens as input to the language model. Autoregressive VLMs are pretrained using causal attention for next-token prediction, while Diffusion VLMs are pretrained using bidirectional attention for token unmasking.
  • Figure 3: Embedding performance of Autoregressive and Diffusion VLMs under varying fine-tuning data scales. VLM2Vec reflects performance after large-scale training on 662k samples aggregated across all meta-tasks. In contrast, the fine-tuning curves show that the majority of performance gains are achieved with substantially smaller datasets, with diminishing returns as data scale increases. As VLM2Vec is trained in a multi-task setting, it may underperform on certain tasks (e.g., VQA) compared to other models fine-tuned specifically for a single meta-task.
  • Figure 4: t-SNE visualization of query–target embedding pairs on the MSCOCO_i2t dataset for LaViDa and LLaVA-1.6 fine-tuned with different amounts of training data. Circles represent query embeddings and triangles represent target embeddings. Dashed lines connect corresponding query–target pairs, indicating their relative distances in the projected embedding space.