Table of Contents
Fetching ...

A Comprehensive Survey on Visual Concept Mining in Text-to-image Diffusion Models

Ziqiang Li, Jun Li, Lizhi Xiong, Zhangjie Fu, Zechao Li

TL;DR

This survey addresses the controllability gap in text-to-image diffusion models by organizing Visual Concept Mining (VCM) into four operational modes: Concept Learning, Concept Erasing, Concept Decomposition, and Concept Combination. It structures methods into tuning-based vs tuning-free approaches, active vs passive erasure, and token-, embedding-, or feature-based decomposition, with emphasis on their fidelity, speed, and generalization. The paper catalogs representative techniques (e.g., DreamBooth, Textual Inversion, IP-Adapter, MACE, AdvDM, Break-A-Scene, Hi-CoDe) and analyzes how fusion of textual and visual signals underpins each method, while highlighting safety, scalability, and robustness challenges. By outlining a unified taxonomy and identifying open problems, the work aims to guide future research toward more flexible, robust, and scalable VCM solutions for personalized and controllable diffusion-based generation in practical applications.

Abstract

Text-to-image diffusion models have made significant advancements in generating high-quality, diverse images from text prompts. However, the inherent limitations of textual signals often prevent these models from fully capturing specific concepts, thereby reducing their controllability. To address this issue, several approaches have incorporated personalization techniques, utilizing reference images to mine visual concept representations that complement textual inputs and enhance the controllability of text-to-image diffusion models. Despite these advances, a comprehensive, systematic exploration of visual concept mining remains limited. In this paper, we categorize existing research into four key areas: Concept Learning, Concept Erasing, Concept Decomposition, and Concept Combination. This classification provides valuable insights into the foundational principles of Visual Concept Mining (VCM) techniques. Additionally, we identify key challenges and propose future research directions to propel this important and interesting field forward.

A Comprehensive Survey on Visual Concept Mining in Text-to-image Diffusion Models

TL;DR

This survey addresses the controllability gap in text-to-image diffusion models by organizing Visual Concept Mining (VCM) into four operational modes: Concept Learning, Concept Erasing, Concept Decomposition, and Concept Combination. It structures methods into tuning-based vs tuning-free approaches, active vs passive erasure, and token-, embedding-, or feature-based decomposition, with emphasis on their fidelity, speed, and generalization. The paper catalogs representative techniques (e.g., DreamBooth, Textual Inversion, IP-Adapter, MACE, AdvDM, Break-A-Scene, Hi-CoDe) and analyzes how fusion of textual and visual signals underpins each method, while highlighting safety, scalability, and robustness challenges. By outlining a unified taxonomy and identifying open problems, the work aims to guide future research toward more flexible, robust, and scalable VCM solutions for personalized and controllable diffusion-based generation in practical applications.

Abstract

Text-to-image diffusion models have made significant advancements in generating high-quality, diverse images from text prompts. However, the inherent limitations of textual signals often prevent these models from fully capturing specific concepts, thereby reducing their controllability. To address this issue, several approaches have incorporated personalization techniques, utilizing reference images to mine visual concept representations that complement textual inputs and enhance the controllability of text-to-image diffusion models. Despite these advances, a comprehensive, systematic exploration of visual concept mining remains limited. In this paper, we categorize existing research into four key areas: Concept Learning, Concept Erasing, Concept Decomposition, and Concept Combination. This classification provides valuable insights into the foundational principles of Visual Concept Mining (VCM) techniques. Additionally, we identify key challenges and propose future research directions to propel this important and interesting field forward.

Paper Structure

This paper contains 23 sections, 5 equations, 1 figure, 1 table.

Figures (1)

  • Figure 1: Visualization of several Visual Concept Mining tasks in text-to-image diffusion models.