Table of Contents
Fetching ...

Generative Visual Instruction Tuning

Jefferson Hernandez, Ruben Villegas, Vicente Ordonez

TL;DR

GenLLaVA tackles the challenge of delivering a single large multimodal system that excels in image understanding, generation, and editing without sacrificing performance on each. It achieves this by a single-stage instruction-tuning pipeline that fuses a strong vision encoder (SigLIP), a capable language model (Mistral-7B), and a diffusion-based generation head (Stable Diffusion), all coordinated through task tokens and GVIT data curated with GPT-4V. The paper provides a comprehensive evaluation across visual understanding and generation benchmarks, showing GenLLaVA surpassing prior LVMMs like LLaVA and achieving competitive results with Unified-IO 2, while remaining open-source. This work demonstrates the viability of reusing existing multimodal components to build versatile, general-purpose visual assistants and lays the groundwork for extending capabilities to video and audio-visual tasks.

Abstract

We propose to use automatically generated instruction-following data to improve the zero-shot capabilities of a large multimodal model with additional support for generative and image editing tasks. We achieve this by curating a new multimodal instruction-following set using GPT-4V and existing datasets for image generation and editing. Using this instruction set and the existing LLaVA-Finetune instruction set for visual understanding tasks, we produce GenLLaVA, a Generative Large Language and Visual Assistant. GenLLaVA is built through a strategy that combines three types of large pretrained models through instruction finetuning: Mistral for language modeling, SigLIP for image-text matching, and StableDiffusion for text-to-image generation. Our model demonstrates visual understanding capabilities superior to LLaVA and additionally demonstrates competitive results with native multimodal models such as Unified-IO 2, paving the way for building advanced general-purpose visual assistants by effectively re-using existing multimodal models. We open-source our dataset, codebase, and model checkpoints to foster further research and application in this domain.

Generative Visual Instruction Tuning

TL;DR

GenLLaVA tackles the challenge of delivering a single large multimodal system that excels in image understanding, generation, and editing without sacrificing performance on each. It achieves this by a single-stage instruction-tuning pipeline that fuses a strong vision encoder (SigLIP), a capable language model (Mistral-7B), and a diffusion-based generation head (Stable Diffusion), all coordinated through task tokens and GVIT data curated with GPT-4V. The paper provides a comprehensive evaluation across visual understanding and generation benchmarks, showing GenLLaVA surpassing prior LVMMs like LLaVA and achieving competitive results with Unified-IO 2, while remaining open-source. This work demonstrates the viability of reusing existing multimodal components to build versatile, general-purpose visual assistants and lays the groundwork for extending capabilities to video and audio-visual tasks.

Abstract

We propose to use automatically generated instruction-following data to improve the zero-shot capabilities of a large multimodal model with additional support for generative and image editing tasks. We achieve this by curating a new multimodal instruction-following set using GPT-4V and existing datasets for image generation and editing. Using this instruction set and the existing LLaVA-Finetune instruction set for visual understanding tasks, we produce GenLLaVA, a Generative Large Language and Visual Assistant. GenLLaVA is built through a strategy that combines three types of large pretrained models through instruction finetuning: Mistral for language modeling, SigLIP for image-text matching, and StableDiffusion for text-to-image generation. Our model demonstrates visual understanding capabilities superior to LLaVA and additionally demonstrates competitive results with native multimodal models such as Unified-IO 2, paving the way for building advanced general-purpose visual assistants by effectively re-using existing multimodal models. We open-source our dataset, codebase, and model checkpoints to foster further research and application in this domain.
Paper Structure (32 sections, 3 equations, 6 figures, 5 tables)

This paper contains 32 sections, 3 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Comparison of GenLLaVA against recent architectures. Unlike BLIP-2 li2023blip, we use a Linear projector similar to the LlaVA architecture liu2023llava. Generation capabilities are added using a diffusion model, but unlike GILL koh2024generating, we use a Q-former as the generation head. Finally, our model benefits from using a stronger visual encoder, namely SigLIPzhai2023sigmoid; a stronger LLM, namely Mistral-7b jiang2023mistral; and a stronger diffuser, namely SDv1.4 Rombach_2022_CVPR. $^{*}$L stands for Linear projection, and Q stands for Q-former resampler.
  • Figure 2: Editing capabilities of our model. GPT4 currently uses a version of the DALLE text-to-image model as a tool and, hence, is not directly able to edit images. GPT4o instead uses tools through Python-generated code to accomplish the requested action. Our model, GenLLaVA, connects input features obtained from CLIP to a language model that also produces output embeddings for a text-to-image StableDiffusion model, achieving an end-to-end editing task with a multimodal model.
  • Figure 3: Qualitative conversational example of our model. The dashed line indicates that the conversation has to be restarted from the beginning due to the model losing track of it.
  • Figure 4: (Left) Results on selected Visual Question answering datasets. (Right) A qualitative example of our model.
  • Figure 5: Comparisons of VQA capabilities among GenLLaVA, Unified-IO 2, MGIE, and GILL. One can observe that GenLLaVA is able to describe the image in detail and respond to commonly asked questions, even addressing the unusual aspects within an image. Hallucinations made by the models are highlighted in red.
  • ...and 1 more figures