Generative Visual Instruction Tuning
Jefferson Hernandez, Ruben Villegas, Vicente Ordonez
TL;DR
GenLLaVA tackles the challenge of delivering a single large multimodal system that excels in image understanding, generation, and editing without sacrificing performance on each. It achieves this by a single-stage instruction-tuning pipeline that fuses a strong vision encoder (SigLIP), a capable language model (Mistral-7B), and a diffusion-based generation head (Stable Diffusion), all coordinated through task tokens and GVIT data curated with GPT-4V. The paper provides a comprehensive evaluation across visual understanding and generation benchmarks, showing GenLLaVA surpassing prior LVMMs like LLaVA and achieving competitive results with Unified-IO 2, while remaining open-source. This work demonstrates the viability of reusing existing multimodal components to build versatile, general-purpose visual assistants and lays the groundwork for extending capabilities to video and audio-visual tasks.
Abstract
We propose to use automatically generated instruction-following data to improve the zero-shot capabilities of a large multimodal model with additional support for generative and image editing tasks. We achieve this by curating a new multimodal instruction-following set using GPT-4V and existing datasets for image generation and editing. Using this instruction set and the existing LLaVA-Finetune instruction set for visual understanding tasks, we produce GenLLaVA, a Generative Large Language and Visual Assistant. GenLLaVA is built through a strategy that combines three types of large pretrained models through instruction finetuning: Mistral for language modeling, SigLIP for image-text matching, and StableDiffusion for text-to-image generation. Our model demonstrates visual understanding capabilities superior to LLaVA and additionally demonstrates competitive results with native multimodal models such as Unified-IO 2, paving the way for building advanced general-purpose visual assistants by effectively re-using existing multimodal models. We open-source our dataset, codebase, and model checkpoints to foster further research and application in this domain.
