An Image is Worth Multiple Words: Discovering Object Level Concepts using Multi-Concept Prompt Learning

Chen Jin; Ryutaro Tanno; Amrutha Saseendran; Tom Diethe; Philip Teare

An Image is Worth Multiple Words: Discovering Object Level Concepts using Multi-Concept Prompt Learning

Chen Jin, Ryutaro Tanno, Amrutha Saseendran, Tom Diethe, Philip Teare

TL;DR

This work introduces Multi-Concept Prompt Learning (MCPL), a mask-free approach to discover and learn multiple object-level concepts from a single sentence-image pair by updating only textual embeddings. It builds on Textual Inversion and leverages cross-attention in frozen diffusion models, augmented with three regularisers—AttnMask, PromptCL, and Bind adj—to achieve focused, disentangled concept representations and accurate word-region correlations. The authors provide a new multi-concept dataset (25 concepts, 1,000 sentence-image pairs) and demonstrate robust performance across natural and biomedical images, including qualitative editing capabilities and user studies, while highlighting both the storage efficiency and limitations in complex scenes. Overall, MCPL enables mask-free local editing and hypothesis generation by language-driven concept discovery, offering a scalable, low-storage pathway for learning unseen concepts from text descriptions. The work suggests a practical route for knowledge discovery in scientific and medical domains where annotations are scarce or unavailable.

Abstract

Textural Inversion, a prompt learning method, learns a singular text embedding for a new "word" to represent image style and appearance, allowing it to be integrated into natural language sentences to generate novel synthesised images. However, identifying multiple unknown object-level concepts within one scene remains a complex challenge. While recent methods have resorted to cropping or masking individual images to learn multiple concepts, these techniques often require prior knowledge of new concepts and are labour-intensive. To address this challenge, we introduce Multi-Concept Prompt Learning (MCPL), where multiple unknown "words" are simultaneously learned from a single sentence-image pair, without any imagery annotations. To enhance the accuracy of word-concept correlation and refine attention mask boundaries, we propose three regularisation techniques: Attention Masking, Prompts Contrastive Loss, and Bind Adjective. Extensive quantitative comparisons with both real-world categories and biomedical images demonstrate that our method can learn new semantically disentangled concepts. Our approach emphasises learning solely from textual embeddings, using less than 10% of the storage space compared to others. The project page, code, and data are available at https://astrazeneca.github.io/mcpl.github.io.

An Image is Worth Multiple Words: Discovering Object Level Concepts using Multi-Concept Prompt Learning

TL;DR

Abstract

Paper Structure (54 sections, 7 equations, 47 figures, 2 tables, 4 algorithms)

This paper contains 54 sections, 7 equations, 47 figures, 2 tables, 4 algorithms.

Introduction
Related Works
Language-driven vision concept discovery.
Prompt learning for Diffusion Model.
Multiple concept learning and composing.
Methods
Preliminaries
Motivational study
Multi-Concept Prompt Learning (MCPL)
Training strategies and preliminary results.
Limitations of plain MCPL.
Regularising the multi-concept prompts learning
Encouraging focused prompt-concept correlation with Attention Masking (AttnMask).
Encouraging semantically disentangled multi-concepts with Prompts Contrastive Loss (PromptCL).
Enhance prompt-concept correlation by binding learnable prompt with the adjective word (Bind adj.).
...and 39 more sections

Figures (47)

Figure 1: Language driven multi-concepts learning and applications. Custom Diffusion (CD) and Cones learn concepts from crops of objects, while Break-A-Scene uses masks. In contrast, our method learns object-level concepts using image-sentence pairs, aligning the cross-attention of each learnable prompt with a semantically meaningful region, and enabling mask-free local editing. The project page, code, and data are available at https://astrazeneca.github.io/mcpl.github.io.
Figure 2: Motivational study and preliminary MCPL results. We use Textual Inversion (T.I.) to learn concepts from both masked (left-first) or cropped (left-second) images; MCPL-one, learning both concepts jointly from the full image with a single string; and MCPL-diverse accounting for per-image specific relationships.
Figure 3: Method overview.MCPL takes a sentence (top-left) and a sample image $x_0$ (top-right) as input, feeding them into a pre-trained text-guided diffusion model comprising a text encoder $c_\phi$ and a denoising network $\epsilon_\theta$. The string's multiple prompts are encoded into a sequence of embeddings which guide the network to generate images $\tilde{x}_0$ close to the target one $x_0$. MCPL focuses on learning multiple learnable prompts (coloured texts), updating only the embeddings $v^*$ and $v^\&$ of the learnable prompts while keeping $c_\phi$ and $\epsilon_\theta$ frozen. We introduce Prompts Contrastive Loss (PromptCL) to help separate multiple concepts within learnable embeddings. We also apply Attention Masking (AttnMask), using masks based on the average cross-attention of prompts, to refine prompt learning on images. Optionally we associate each learnable prompt with an adjective (e.g., "brown") to improve control over each learned concept, referred to as Bind adj.
Figure 4: Enhancing object-level prompt-concept correlation in MCPL using the proposed regularisations: AttnMask, PromptCL and Bind adj.. We compare MCPL-one applying all regularisation terms against the MCPL-one, using a "Ball and Box" example. We use the average cross-attention maps and the AttnMask to assess the accuracy of correlation. Full ablation results in Appendix \ref{['sec: full_ablation']}
Figure 5: The t-SNE projection of the learned embeddings. Our method can effectively distinguish all learned concepts (about 10 embeddings each concept) compared to Textual Inversion (MCPL-all), the SoTA mask-based learning method, Break-A-Scene, and the masked "ground truth" (see full results in Appendix \ref{['sec: full_tsne']}).
...and 42 more figures

An Image is Worth Multiple Words: Discovering Object Level Concepts using Multi-Concept Prompt Learning

TL;DR

Abstract

An Image is Worth Multiple Words: Discovering Object Level Concepts using Multi-Concept Prompt Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (47)