Table of Contents
Fetching ...

IPAdapter-Instruct: Resolving Ambiguity in Image-based Conditioning using Instruct Prompts

Ciara Rowles, Shimon Vainer, Dante De Nigris, Slava Elizarov, Konstantin Kutsy, Simon Donné

TL;DR

IPAdapter-Instruct addresses ambiguity in image-conditioned diffusion by introducing an instruction prompt that guides how a conditioning image should be interpreted. The approach extends IPAdapter with an instruction-attention mechanism, enabling a single model to perform five conditioning tasks (replication, style transfer, composition, object extraction, and identity preservation) while maintaining compatibility with ControlNet and LoRA. Empirical results show the model achieves comparable or better performance to task-specific baselines and benefits from randomizing instructions, all with efficient multi-task training. This work advances practical, flexible image-conditioned generation by unifying multiple posteriors under a single, instruction-driven framework with real-world applicability and potential for broader task integration.

Abstract

Diffusion models continuously push the boundary of state-of-the-art image generation, but the process is hard to control with any nuance: practice proves that textual prompts are inadequate for accurately describing image style or fine structural details (such as faces). ControlNet and IPAdapter address this shortcoming by conditioning the generative process on imagery instead, but each individual instance is limited to modeling a single conditional posterior: for practical use-cases, where multiple different posteriors are desired within the same workflow, training and using multiple adapters is cumbersome. We propose IPAdapter-Instruct, which combines natural-image conditioning with ``Instruct'' prompts to swap between interpretations for the same conditioning image: style transfer, object extraction, both, or something else still? IPAdapterInstruct efficiently learns multiple tasks with minimal loss in quality compared to dedicated per-task models.

IPAdapter-Instruct: Resolving Ambiguity in Image-based Conditioning using Instruct Prompts

TL;DR

IPAdapter-Instruct addresses ambiguity in image-conditioned diffusion by introducing an instruction prompt that guides how a conditioning image should be interpreted. The approach extends IPAdapter with an instruction-attention mechanism, enabling a single model to perform five conditioning tasks (replication, style transfer, composition, object extraction, and identity preservation) while maintaining compatibility with ControlNet and LoRA. Empirical results show the model achieves comparable or better performance to task-specific baselines and benefits from randomizing instructions, all with efficient multi-task training. This work advances practical, flexible image-conditioned generation by unifying multiple posteriors under a single, instruction-driven framework with real-world applicability and potential for broader task integration.

Abstract

Diffusion models continuously push the boundary of state-of-the-art image generation, but the process is hard to control with any nuance: practice proves that textual prompts are inadequate for accurately describing image style or fine structural details (such as faces). ControlNet and IPAdapter address this shortcoming by conditioning the generative process on imagery instead, but each individual instance is limited to modeling a single conditional posterior: for practical use-cases, where multiple different posteriors are desired within the same workflow, training and using multiple adapters is cumbersome. We propose IPAdapter-Instruct, which combines natural-image conditioning with ``Instruct'' prompts to swap between interpretations for the same conditioning image: style transfer, object extraction, both, or something else still? IPAdapterInstruct efficiently learns multiple tasks with minimal loss in quality compared to dedicated per-task models.
Paper Structure (25 sections, 2 equations, 11 figures, 1 table)

This paper contains 25 sections, 2 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: By adding a conditioning instruction prompt, IPAdapter-Instruct combines multiple image conditioning methods for easier handling and more efficient training.
  • Figure 1: Quantitative comparison of our IPAdapter-Instruct model to task-specific models, and to a model trained and evaluated with hardcoded queries. Very little quantitative difference is found so that we prefer the single joint model with flexible instruction queries. Higher is better for all of these metrics.
  • Figure 2: IPAdapter(+) and IPAdapter-Instruct both inject the conditioning in the same way, using an additional cross-attention layer after every text prompt cross attention layer. Only the way the condition is condensed into conditioning features differs, as discussed in \ref{['sec:architecture']} and shown in \ref{['fig:overview']}. The original network weights (including the text prompt cross-attention) are kept frozen and only the new cross-attention is being learned.
  • Figure 3: The architecture of IPAdapter+'s condition image projection (orange), and our architecture with the additional attention layers to the instruct prompt (green). All layers within the transformer are residual layers.
  • Figure 4: t-SNE visualization of the CLIP embeddings of 1000 instruction prompts per task in our dataset. Note how well the tasks are delineated, despite their spread. Hard-coded prompts for the ablation study are shown as crosses.
  • ...and 6 more figures