An Image is Worth Multiple Words: Discovering Object Level Concepts using Multi-Concept Prompt Learning
Chen Jin, Ryutaro Tanno, Amrutha Saseendran, Tom Diethe, Philip Teare
TL;DR
This work introduces Multi-Concept Prompt Learning (MCPL), a mask-free approach to discover and learn multiple object-level concepts from a single sentence-image pair by updating only textual embeddings. It builds on Textual Inversion and leverages cross-attention in frozen diffusion models, augmented with three regularisers—AttnMask, PromptCL, and Bind adj—to achieve focused, disentangled concept representations and accurate word-region correlations. The authors provide a new multi-concept dataset (25 concepts, 1,000 sentence-image pairs) and demonstrate robust performance across natural and biomedical images, including qualitative editing capabilities and user studies, while highlighting both the storage efficiency and limitations in complex scenes. Overall, MCPL enables mask-free local editing and hypothesis generation by language-driven concept discovery, offering a scalable, low-storage pathway for learning unseen concepts from text descriptions. The work suggests a practical route for knowledge discovery in scientific and medical domains where annotations are scarce or unavailable.
Abstract
Textural Inversion, a prompt learning method, learns a singular text embedding for a new "word" to represent image style and appearance, allowing it to be integrated into natural language sentences to generate novel synthesised images. However, identifying multiple unknown object-level concepts within one scene remains a complex challenge. While recent methods have resorted to cropping or masking individual images to learn multiple concepts, these techniques often require prior knowledge of new concepts and are labour-intensive. To address this challenge, we introduce Multi-Concept Prompt Learning (MCPL), where multiple unknown "words" are simultaneously learned from a single sentence-image pair, without any imagery annotations. To enhance the accuracy of word-concept correlation and refine attention mask boundaries, we propose three regularisation techniques: Attention Masking, Prompts Contrastive Loss, and Bind Adjective. Extensive quantitative comparisons with both real-world categories and biomedical images demonstrate that our method can learn new semantically disentangled concepts. Our approach emphasises learning solely from textual embeddings, using less than 10% of the storage space compared to others. The project page, code, and data are available at https://astrazeneca.github.io/mcpl.github.io.
