Table of Contents
Fetching ...

IntentTuner: An Interactive Framework for Integrating Human Intents in Fine-tuning Text-to-Image Generative Models

Xingchen Zeng, Ziyao Gao, Yilin Ye, Wei Zeng

TL;DR

IntentTuner presents a multi-modal, interactive framework that embeds human intents into fine-tuning of text-to-image models. By translating natural language and visual exemplars into structured intent specifications, it guides data augmentation, caption optimization, and intent-aware evaluation, unifying fine-tuning with generation in a single interface. Through formative study, application scenarios, and a user study, the approach demonstrates improved alignment with user goals and reduced cognitive load relative to baselines. The work advances controllability and accessibility in personalized AIGC, with implications for ethical data use and cross-domain creative workflows.

Abstract

Fine-tuning facilitates the adaptation of text-to-image generative models to novel concepts (e.g., styles and portraits), empowering users to forge creatively customized content. Recent efforts on fine-tuning focus on reducing training data and lightening computation overload but neglect alignment with user intentions, particularly in manual curation of multi-modal training data and intent-oriented evaluation. Informed by a formative study with fine-tuning practitioners for comprehending user intentions, we propose IntentTuner, an interactive framework that intelligently incorporates human intentions throughout each phase of the fine-tuning workflow. IntentTuner enables users to articulate training intentions with imagery exemplars and textual descriptions, automatically converting them into effective data augmentation strategies. Furthermore, IntentTuner introduces novel metrics to measure user intent alignment, allowing intent-aware monitoring and evaluation of model training. Application exemplars and user studies demonstrate that IntentTuner streamlines fine-tuning, reducing cognitive effort and yielding superior models compared to the common baseline tool.

IntentTuner: An Interactive Framework for Integrating Human Intents in Fine-tuning Text-to-Image Generative Models

TL;DR

IntentTuner presents a multi-modal, interactive framework that embeds human intents into fine-tuning of text-to-image models. By translating natural language and visual exemplars into structured intent specifications, it guides data augmentation, caption optimization, and intent-aware evaluation, unifying fine-tuning with generation in a single interface. Through formative study, application scenarios, and a user study, the approach demonstrates improved alignment with user goals and reduced cognitive load relative to baselines. The work advances controllability and accessibility in personalized AIGC, with implications for ethical data use and cross-domain creative workflows.

Abstract

Fine-tuning facilitates the adaptation of text-to-image generative models to novel concepts (e.g., styles and portraits), empowering users to forge creatively customized content. Recent efforts on fine-tuning focus on reducing training data and lightening computation overload but neglect alignment with user intentions, particularly in manual curation of multi-modal training data and intent-oriented evaluation. Informed by a formative study with fine-tuning practitioners for comprehending user intentions, we propose IntentTuner, an interactive framework that intelligently incorporates human intentions throughout each phase of the fine-tuning workflow. IntentTuner enables users to articulate training intentions with imagery exemplars and textual descriptions, automatically converting them into effective data augmentation strategies. Furthermore, IntentTuner introduces novel metrics to measure user intent alignment, allowing intent-aware monitoring and evaluation of model training. Application exemplars and user studies demonstrate that IntentTuner streamlines fine-tuning, reducing cognitive effort and yielding superior models compared to the common baseline tool.
Paper Structure (36 sections, 2 equations, 10 figures, 1 table)

This paper contains 36 sections, 2 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: Comparison of pipelines of general and our intent-aligned fine-tuning framework. (a) General pipeline. Most users rely on a trial and error process to check whether the system properly understands their intents, where they manually preprocess the training images, such as cropping , categorizing and tagging , and observe the generated images. (b) Our pipeline. IntentTuner allows users to efficiently articulate their intents to automatically steer important milestones of the fine-tuning, including data augmentation, training monitoring, and evaluation.
  • Figure 2: General workflow. The input Raw Image Set is enhanced during the Data Pre-processing phase by cropping, categorizing, and tagging according to the intended requirements, producing images and captions that align with the intention. Next, using the processed image-caption pairs, the Model Training begins. Users can monitor the progress of model training to determine whether to continue or stop. Finally, users generate images to conduct Model Evaluation. Users manually input various prompts to test the model's performance from perspectives related to the intention. After manual inspection, the optimal model is selected.
  • Figure 3: User intention. We summarize user intentions in Domain, Concept, and Operation. Domain refers to the specialized area of creation, such as "2D character". Concept defines the specific elements in the intentions, including three different granularities: Attribute (e.g., "hair color"), Instance (e.g., "face", "costume") and Imagery (e.g., "lighting", "blurry background"). Operation encompasses three different types of intended manipulations on specific concepts, including Keep, Modify, and Delete. For example, P1 wants to keep the costume, modify the hairstyle, and delete the watermarks.
  • Figure 4: Language-vision intent input and transformation. We allow users to provide detailed multi-modal input to clarify their intents, including the description text and reference images. Powered by the language model, the user input will be transformed into intent specifications, including trigger words, domain, concepts, and operations.
  • Figure 5: Image augmentation. Based on the intent specifications shown in Fig. \ref{['fig:intent-input']}, we introduce a language-vision intent filter to transfer users' precise intentions to achieve intent-guided data augmentation. Specifically, the fine-grained concepts are passed to a cross-modal Detection module, which can disambiguate the intended concepts and locate the corresponding visual concepts. Then, in the Filter module, users can accurately retrieve samples with the specified concepts with the help of the reference images. Finally, the concept-aligned samples are augmented based on different intended operations to provide more intent-aligned fine-tuning data.
  • ...and 5 more figures