Table of Contents
Fetching ...

DEEM: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception

Run Luo, Yunshui Li, Longze Chen, Wanwei He, Ting-En Lin, Ziqiang Liu, Lei Zhang, Zikai Song, Xiaobo Xia, Tongliang Liu, Min Yang, Binyuan Hui

TL;DR

DEEM tackles the fragility of image perception in large multimodal models by introducing diffusion-model feedback as an extra, self-supervised 'eye' to correct semantic bias in image encoders. The method integrates a diffusion-based image decoder with a vision-language backbone in an end-to-end, interleaved image-text framework, aided by a consistency regularization term. Empirical results on RobustVQA, POPE, and MMVP show notable gains in robustness and reduced visual hallucinations, achieved with smaller encoders and fewer training data. The work advances multimodal robustness by leveraging generative feedback, and it lays groundwork for broader, safer multimodal reasoning and creation tasks.

Abstract

The development of large language models (LLMs) has significantly advanced the emergence of large multimodal models (LMMs). While LMMs have achieved tremendous success by promoting the synergy between multimodal comprehension and creation, they often face challenges when confronted with out-of-distribution data, such as which can hardly distinguish orientation, quantity, color, structure, etc. This is primarily due to their reliance on image encoders trained to encode images into task-relevant features, which may lead them to disregard irrelevant details. Delving into the modeling capabilities of diffusion models for images naturally prompts the question: Can diffusion models serve as the eyes of large language models for image perception? In this paper, we propose DEEM, a simple but effective approach that utilizes the generative feedback of diffusion models to align the semantic distributions of the image encoder. This addresses the drawbacks of previous methods that solely relied on image encoders like CLIP-ViT, thereby enhancing the model's resilience against out-of-distribution samples and reducing visual hallucinations. Importantly, this is achieved without requiring additional training modules and with fewer training parameters. We extensively evaluated DEEM on both our newly constructed RobustVQA benchmark and other well-known benchmarks, POPE and MMVP, for visual hallucination and perception. In particular, DEEM improves LMM's visual perception performance to a large extent (e.g., 4% higher on RobustVQA, 6.5% higher on MMVP and 12.8 % higher on POPE ). Compared to the state-of-the-art interleaved content generation models, DEEM exhibits enhanced robustness and a superior capacity to alleviate model hallucinations while utilizing fewer trainable parameters, less pre-training data (10%), and a smaller base model size.

DEEM: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception

TL;DR

DEEM tackles the fragility of image perception in large multimodal models by introducing diffusion-model feedback as an extra, self-supervised 'eye' to correct semantic bias in image encoders. The method integrates a diffusion-based image decoder with a vision-language backbone in an end-to-end, interleaved image-text framework, aided by a consistency regularization term. Empirical results on RobustVQA, POPE, and MMVP show notable gains in robustness and reduced visual hallucinations, achieved with smaller encoders and fewer training data. The work advances multimodal robustness by leveraging generative feedback, and it lays groundwork for broader, safer multimodal reasoning and creation tasks.

Abstract

The development of large language models (LLMs) has significantly advanced the emergence of large multimodal models (LMMs). While LMMs have achieved tremendous success by promoting the synergy between multimodal comprehension and creation, they often face challenges when confronted with out-of-distribution data, such as which can hardly distinguish orientation, quantity, color, structure, etc. This is primarily due to their reliance on image encoders trained to encode images into task-relevant features, which may lead them to disregard irrelevant details. Delving into the modeling capabilities of diffusion models for images naturally prompts the question: Can diffusion models serve as the eyes of large language models for image perception? In this paper, we propose DEEM, a simple but effective approach that utilizes the generative feedback of diffusion models to align the semantic distributions of the image encoder. This addresses the drawbacks of previous methods that solely relied on image encoders like CLIP-ViT, thereby enhancing the model's resilience against out-of-distribution samples and reducing visual hallucinations. Importantly, this is achieved without requiring additional training modules and with fewer training parameters. We extensively evaluated DEEM on both our newly constructed RobustVQA benchmark and other well-known benchmarks, POPE and MMVP, for visual hallucination and perception. In particular, DEEM improves LMM's visual perception performance to a large extent (e.g., 4% higher on RobustVQA, 6.5% higher on MMVP and 12.8 % higher on POPE ). Compared to the state-of-the-art interleaved content generation models, DEEM exhibits enhanced robustness and a superior capacity to alleviate model hallucinations while utilizing fewer trainable parameters, less pre-training data (10%), and a smaller base model size.
Paper Structure (32 sections, 5 equations, 15 figures, 10 tables)

This paper contains 32 sections, 5 equations, 15 figures, 10 tables.

Figures (15)

  • Figure 1: Illustration of our DEEM . When encountering natural adversarial examples or out-of-distribution data, DEEM uses the diffusion model to check if the semantic features of the image encoder match the input images. This approach allows DEEM to serve as the "eyes" of the large language model, proactively identifying and correcting misinterpreted semantic information during training, thereby avoiding the loss of important visual details. This enhances the robustness, hallucination recognition, and foundational visual perception capabilities of LMMs. In contrast, other models rely too heavily on erroneous inputs from the image encoder, making it difficult for them to handle challenges posed by such data.
  • Figure 2: Overview of our DEEM framework. Interleaved documents serve as input, decoded to produce outputs. Both text and images are encoded into sequential, discrete token embeddings for the LMM input. Here, we replace the $<$IMG$>$ token embedding in the text with the image embedding before inputting it into the LLM. The text is predicted in an autoregressive manner and the images are synthesized by the DM-based image decoder conditioned on holistic historical semantics captured by LMM. Besides, the image token embeddings are fed into DM-based image decoder for consistent image restoration. The start of image token $<$SOI$>$ is used to determine the starting position of the image, facilitating the natural autoregressive generation of interleaved text-image layouts. Note that our core architecture is presented without the connectors between modules for simplicity.
  • Figure 3: Pipeline of Mask-Aware Extractor. The mask-aware extractor can be used to extract region-level visual features based on the mask-aware operation. A simple dot product is applied between the mask and the image embedding before being fed into the LLM.
  • Figure 4: Examples from ImageNet-R, ImageNet-A, and ImageNet-V2. These examples share similar backgrounds, rare materials, and unusual textures. They serve as natural adversarial examples and out-of-distribution data, which can be used to test the robustness of models.
  • Figure 5: Zero-shot text-to-image generation FID on MS-COCO and LN-COCO.
  • ...and 10 more figures