Understanding Visual Concepts Across Models
Brandon Trabucco, Max Gurinas, Kyle Doherty, Ruslan Salakhutdinov
TL;DR
The paper tackles whether word embeddings that encode new visual concepts, learned via soft-prompt tuning, transfer across distinct large multimodal models. It presents a large-scale study across three tasks—text-to-image generation, open-set object detection, and zero-shot classification—using four datasets and 4,800 embeddings for 40 concepts, and frames cross-model transfer through a linear transfer function $T^{y \to x}$ between embedding spaces. The key findings are that embeddings are largely model-specific and non-transferable, with a fracturing of the embedding space: perturbations within an $\epsilon$-ball can elicit the target concept across arbitrary prompts, but transfers seldom preserve in-domain performance. The work highlights fundamental interoperability challenges in multimodal systems, showing that reusing prompts across models is not generally feasible and suggesting that maintaining awareness of model-specific geometry and late-layer focus in text encoders is crucial for robust cross-task concept encoding. The results have practical implications for designing reusable prompts and for evaluating transferability in evolving multimodal ecosystems, while also noting safety and ethical considerations around privacy and misuse.
Abstract
Large multimodal models such as Stable Diffusion can generate, detect, and classify new visual concepts after fine-tuning just a single word embedding. Do models learn similar words for the same concepts (i.e. <orange-cat> = orange + cat)? We conduct a large-scale analysis on three state-of-the-art models in text-to-image generation, open-set object detection, and zero-shot classification, and find that new word embeddings are model-specific and non-transferable. Across 4,800 new embeddings trained for 40 diverse visual concepts on four standard datasets, we find perturbations within an $ε$-ball to any prior embedding that generate, detect, and classify an arbitrary concept. When these new embeddings are spliced into new models, fine-tuning that targets the original model is lost. We show popular soft prompt-tuning approaches find these perturbative solutions when applied to visual concept learning tasks, and embeddings for visual concepts are not transferable. Code for reproducing our work is available at: https://visual-words.github.io.
