Table of Contents
Fetching ...

InstructAny2Pix: Flexible Visual Editing via Multimodal Instruction Following

Shufan Li, Harkanwar Singh, Aditya Grover

TL;DR

InstructAny2Pix introduces a flexible framework for editing images using multi-modal prompts that can interleave text, images, and audio. It combines a multi-modal encoder, a multi-modal LLM, and a diffusion-based decoder with a refinement module to generate high-fidelity edits, supported by MM-Inst and Dreambooth++ benchmarks. The approach demonstrates strong performance on complex multi-object edits and multi-modal tasks, including music-guided design, while maintaining competitive zero-shot results on traditional text-based edits. The work enables versatile applications in design and live visuals, though it notes biases and stylistic tendencies as important limitations to address in future work.

Abstract

The ability to provide fine-grained control for generating and editing visual imagery has profound implications for computer vision and its applications. Previous works have explored extending controllability in two directions: instruction tuning with text-based prompts and multi-modal conditioning. However, these works make one or more unnatural assumptions on the number and/or type of modality inputs used to express controllability. We propose InstructAny2Pix, a flexible multi-modal instruction-following system that enables users to edit an input image using instructions involving audio, images, and text. InstructAny2Pix consists of three building blocks that facilitate this capability: a multi-modal encoder that encodes different modalities such as images and audio into a unified latent space, a diffusion model that learns to decode representations in this latent space into images, and a multi-modal LLM that can understand instructions involving multiple images and audio pieces and generate a conditional embedding of the desired output, which can be used by the diffusion decoder. Additionally, to facilitate training efficiency and improve generation quality, we include an additional refinement prior module that enhances the visual quality of LLM outputs. These designs are critical to the performance of our system. We demonstrate that our system can perform a series of novel instruction-guided editing tasks. The code is available at https://github.com/jacklishufan/InstructAny2Pix.git

InstructAny2Pix: Flexible Visual Editing via Multimodal Instruction Following

TL;DR

InstructAny2Pix introduces a flexible framework for editing images using multi-modal prompts that can interleave text, images, and audio. It combines a multi-modal encoder, a multi-modal LLM, and a diffusion-based decoder with a refinement module to generate high-fidelity edits, supported by MM-Inst and Dreambooth++ benchmarks. The approach demonstrates strong performance on complex multi-object edits and multi-modal tasks, including music-guided design, while maintaining competitive zero-shot results on traditional text-based edits. The work enables versatile applications in design and live visuals, though it notes biases and stylistic tendencies as important limitations to address in future work.

Abstract

The ability to provide fine-grained control for generating and editing visual imagery has profound implications for computer vision and its applications. Previous works have explored extending controllability in two directions: instruction tuning with text-based prompts and multi-modal conditioning. However, these works make one or more unnatural assumptions on the number and/or type of modality inputs used to express controllability. We propose InstructAny2Pix, a flexible multi-modal instruction-following system that enables users to edit an input image using instructions involving audio, images, and text. InstructAny2Pix consists of three building blocks that facilitate this capability: a multi-modal encoder that encodes different modalities such as images and audio into a unified latent space, a diffusion model that learns to decode representations in this latent space into images, and a multi-modal LLM that can understand instructions involving multiple images and audio pieces and generate a conditional embedding of the desired output, which can be used by the diffusion decoder. Additionally, to facilitate training efficiency and improve generation quality, we include an additional refinement prior module that enhances the visual quality of LLM outputs. These designs are critical to the performance of our system. We demonstrate that our system can perform a series of novel instruction-guided editing tasks. The code is available at https://github.com/jacklishufan/InstructAny2Pix.git
Paper Structure (57 sections, 2 equations, 19 figures, 9 tables)

This paper contains 57 sections, 2 equations, 19 figures, 9 tables.

Figures (19)

  • Figure 1: Illustration of InstructAny2Pix's ability to flexibly edit an image based on a variety of multi-modal instructions. More examples of audio-guided editing are provided in the supplementary demo video.
  • Figure 2: The InstructAny2Pix pipeline consists of three building blocks: a multi-modal encoder that "perceives" audiovisual inputs, a large language model that "reasons" about the edit instructions, and a diffusion model that "draws" the edited results. For improved training and generation, we include an additional refinement module to refine the LLM outputs.
  • Figure 3: Training pipeline of InstructAny2Pix consists of four steps. 1. Pretraining of Multi-Modal LLM with text-to-x and x-to-image tasks. 2. Pretraining of Diffusion Decoder 3. Pretraining of Refinement Module. 4.Instruction Finetuning
  • Figure 4: Music-Gudied Image Variation: Music uniquely conveys emotions that are hard to describe using other modalities such as language. We show qualitative results of music guided image variation and music inspired design. InstructAny2Pix is able to understand a diverse set of emotions embedded in music and generate creative designs and edits. We include these examples with audio in our supplementary video.
  • Figure 5: Editing with Multi-Object Instructions. Compared with previous text-based method (MGIE) and image-based method (Kosmos-G), InstructAny2Pix uniquely accomplish complex edit tasks.
  • ...and 14 more figures