Table of Contents
Fetching ...

InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists

Yulu Gan, Sungwoo Park, Alexander Schubert, Anthony Philippakis, Ahmed M. Alaa

TL;DR

InstructCV introduces a unified language interface for computer vision by recasting standard tasks as text-to-image generation guided by natural language instructions. It constructs a multi-modal, multi-task instruction-tuning dataset from four vision datasets and uses LLMs to create diverse instructions, encoding task outputs visually. The model is trained as a conditional latent diffusion model with instruction conditioning and classifier-free guidance, yielding competitive results with strong generalization to unseen data, categories, and prompts, particularly with paraphrased prompts. While offering faster inference than some generalist models, it acknowledges limitations in real-time performance and output admissibility, suggesting future directions in richer instructions and human feedback.

Abstract

Recent advances in generative diffusion models have enabled text-controlled synthesis of realistic and diverse images with impressive quality. Despite these remarkable advances, the application of text-to-image generative models in computer vision for standard visual recognition tasks remains limited. The current de facto approach for these tasks is to design model architectures and loss functions that are tailored to the task at hand. In this paper, we develop a unified language interface for computer vision tasks that abstracts away task-specific design choices and enables task execution by following natural language instructions. Our approach involves casting multiple computer vision tasks as text-to-image generation problems. Here, the text represents an instruction describing the task, and the resulting image is a visually-encoded task output. To train our model, we pool commonly-used computer vision datasets covering a range of tasks, including segmentation, object detection, depth estimation, and classification. We then use a large language model to paraphrase prompt templates that convey the specific tasks to be conducted on each image, and through this process, we create a multi-modal and multi-task training dataset comprising input and output images along with annotated instructions. Following the InstructPix2Pix architecture, we apply instruction-tuning to a text-to-image diffusion model using our constructed dataset, steering its functionality from a generative model to an instruction-guided multi-task vision learner. Experiments demonstrate that our model, dubbed InstructCV, performs competitively compared to other generalist and task-specific vision models. Moreover, it exhibits compelling generalization capabilities to unseen data, categories, and user instructions.

InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists

TL;DR

InstructCV introduces a unified language interface for computer vision by recasting standard tasks as text-to-image generation guided by natural language instructions. It constructs a multi-modal, multi-task instruction-tuning dataset from four vision datasets and uses LLMs to create diverse instructions, encoding task outputs visually. The model is trained as a conditional latent diffusion model with instruction conditioning and classifier-free guidance, yielding competitive results with strong generalization to unseen data, categories, and prompts, particularly with paraphrased prompts. While offering faster inference than some generalist models, it acknowledges limitations in real-time performance and output admissibility, suggesting future directions in richer instructions and human feedback.

Abstract

Recent advances in generative diffusion models have enabled text-controlled synthesis of realistic and diverse images with impressive quality. Despite these remarkable advances, the application of text-to-image generative models in computer vision for standard visual recognition tasks remains limited. The current de facto approach for these tasks is to design model architectures and loss functions that are tailored to the task at hand. In this paper, we develop a unified language interface for computer vision tasks that abstracts away task-specific design choices and enables task execution by following natural language instructions. Our approach involves casting multiple computer vision tasks as text-to-image generation problems. Here, the text represents an instruction describing the task, and the resulting image is a visually-encoded task output. To train our model, we pool commonly-used computer vision datasets covering a range of tasks, including segmentation, object detection, depth estimation, and classification. We then use a large language model to paraphrase prompt templates that convey the specific tasks to be conducted on each image, and through this process, we create a multi-modal and multi-task training dataset comprising input and output images along with annotated instructions. Following the InstructPix2Pix architecture, we apply instruction-tuning to a text-to-image diffusion model using our constructed dataset, steering its functionality from a generative model to an instruction-guided multi-task vision learner. Experiments demonstrate that our model, dubbed InstructCV, performs competitively compared to other generalist and task-specific vision models. Moreover, it exhibits compelling generalization capabilities to unseen data, categories, and user instructions.
Paper Structure (14 sections, 5 equations, 10 figures, 3 tables)

This paper contains 14 sections, 5 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Application of InstructCV to new test images & user-written instructions: InstructCV performs the vision task described in the instruction on the input image. (Images courtesy of UC Berkeley and MIT).
  • Figure 2: Pictorial depiction of the InstructCV training pipeline. (a) We pool multiple computer vision datasets to construct a multi-modal and multi-task set of image pairs, where the target of each task is visually encoded in the form of an output image. Starting with a set of task-specific prompt templates, we sample a new instruction for each training point by using an LLM to rephrase the template for the corresponding task. (b) Using the dataset in (a), we finetune a diffusion model to produce the output ${\bf v}({\bf y})$ given an image ${\bf x}$ & an instruction $\mathcal{I}$.
  • Figure 3: Impact of classifier-free guidance on the outputs of InstructCV for the depth estimation task.
  • Figure 4: Samples of InstructCV outputs across all vision tasks. The segmentation and detection outputs (top two rows) are obtained by applying one prompt for each category and overlaying the results in one output image.
  • Figure 5: InstructCV raw outputs given new unseen input images and user-written instructions. InstructCV shows compelling generalization to new images, categories and instructions for the semantic segmentation, object detection and depth estimation tasks, but falls short in the image classification task.
  • ...and 5 more figures