Table of Contents
Fetching ...

Understanding Visual Concepts Across Models

Brandon Trabucco, Max Gurinas, Kyle Doherty, Ruslan Salakhutdinov

TL;DR

The paper tackles whether word embeddings that encode new visual concepts, learned via soft-prompt tuning, transfer across distinct large multimodal models. It presents a large-scale study across three tasks—text-to-image generation, open-set object detection, and zero-shot classification—using four datasets and 4,800 embeddings for 40 concepts, and frames cross-model transfer through a linear transfer function $T^{y \to x}$ between embedding spaces. The key findings are that embeddings are largely model-specific and non-transferable, with a fracturing of the embedding space: perturbations within an $\epsilon$-ball can elicit the target concept across arbitrary prompts, but transfers seldom preserve in-domain performance. The work highlights fundamental interoperability challenges in multimodal systems, showing that reusing prompts across models is not generally feasible and suggesting that maintaining awareness of model-specific geometry and late-layer focus in text encoders is crucial for robust cross-task concept encoding. The results have practical implications for designing reusable prompts and for evaluating transferability in evolving multimodal ecosystems, while also noting safety and ethical considerations around privacy and misuse.

Abstract

Large multimodal models such as Stable Diffusion can generate, detect, and classify new visual concepts after fine-tuning just a single word embedding. Do models learn similar words for the same concepts (i.e. <orange-cat> = orange + cat)? We conduct a large-scale analysis on three state-of-the-art models in text-to-image generation, open-set object detection, and zero-shot classification, and find that new word embeddings are model-specific and non-transferable. Across 4,800 new embeddings trained for 40 diverse visual concepts on four standard datasets, we find perturbations within an $ε$-ball to any prior embedding that generate, detect, and classify an arbitrary concept. When these new embeddings are spliced into new models, fine-tuning that targets the original model is lost. We show popular soft prompt-tuning approaches find these perturbative solutions when applied to visual concept learning tasks, and embeddings for visual concepts are not transferable. Code for reproducing our work is available at: https://visual-words.github.io.

Understanding Visual Concepts Across Models

TL;DR

The paper tackles whether word embeddings that encode new visual concepts, learned via soft-prompt tuning, transfer across distinct large multimodal models. It presents a large-scale study across three tasks—text-to-image generation, open-set object detection, and zero-shot classification—using four datasets and 4,800 embeddings for 40 concepts, and frames cross-model transfer through a linear transfer function between embedding spaces. The key findings are that embeddings are largely model-specific and non-transferable, with a fracturing of the embedding space: perturbations within an -ball can elicit the target concept across arbitrary prompts, but transfers seldom preserve in-domain performance. The work highlights fundamental interoperability challenges in multimodal systems, showing that reusing prompts across models is not generally feasible and suggesting that maintaining awareness of model-specific geometry and late-layer focus in text encoders is crucial for robust cross-task concept encoding. The results have practical implications for designing reusable prompts and for evaluating transferability in evolving multimodal ecosystems, while also noting safety and ethical considerations around privacy and misuse.

Abstract

Large multimodal models such as Stable Diffusion can generate, detect, and classify new visual concepts after fine-tuning just a single word embedding. Do models learn similar words for the same concepts (i.e. <orange-cat> = orange + cat)? We conduct a large-scale analysis on three state-of-the-art models in text-to-image generation, open-set object detection, and zero-shot classification, and find that new word embeddings are model-specific and non-transferable. Across 4,800 new embeddings trained for 40 diverse visual concepts on four standard datasets, we find perturbations within an -ball to any prior embedding that generate, detect, and classify an arbitrary concept. When these new embeddings are spliced into new models, fine-tuning that targets the original model is lost. We show popular soft prompt-tuning approaches find these perturbative solutions when applied to visual concept learning tasks, and embeddings for visual concepts are not transferable. Code for reproducing our work is available at: https://visual-words.github.io.
Paper Structure (31 sections, 4 equations, 18 figures, 2 tables)

This paper contains 31 sections, 4 equations, 18 figures, 2 tables.

Figures (18)

  • Figure 1: Large multimodal models can learn new words that represent specific concepts, like <black-dog> for the black Labrador retriever on the left in the figure. Do models learn similar words for the same concept? We study the interoperability of new word embeddings that encode visual concepts across three models and tasks, and show that popular soft prompt-tuning approaches find model-specific and non-transferable solutions.
  • Figure 2: Transferring words optimized for generation to detection tasks. We fine-tune the vector embeddings for new words (such as <orange-cat> for the orange cat in the figure) to minimize a noise prediction loss for generation. Vector embeddings are transferred from generation to detection using the Transfer Function $T(\vec{v})$, and used to produce zero-shot instance detections for the target visual concept (in this case, orange cats).
  • Figure 3: Visual word embeddings trained for one task (i.e. generation) perform well on that task, but may not perform well when transferred to another task (i.e. generation $\rightarrow$ detection). In certain directions, such as classification $\rightarrow$ generation, transfer works better than others. To understand when transfer fails, we perform extensive ablations across four standard datasets, and three models in generation, detection, and classification.
  • Figure 4: Generations (rows 2-4) from Stable Diffusion for target concepts (top row) from the DreamBooth and PASCAL datasets. The second row trains word embeddings for generation. The third row transfers word embeddings from classification to generation. The final row transfers from detection. Words trained for generation capture fine-grain details. Words trained for classification work for common concepts on PASCAL, but fail at fine-grain concepts on DreamBooth. Words trained for detection generally don't transfer.
  • Figure 5: Example generations and detections for various concepts (row labels) using solutions found in the immediate neighborhood of unrelated words (column labels). We consistently find new words for generating and detecting arbitrary concepts near unrelated anchor words across ImageNet (examples in Appendix \ref{['appendix:more-examples']}), DreamBooth (first three rows), COCO (last three rows), and PASCAL VOC (Appendix \ref{['appendix:more-examples']}) datasets. The same objects are detected, and in several cases, near-identical images are generated.
  • ...and 13 more figures