Table of Contents
Fetching ...

OSTAF: A One-Shot Tuning Method for Improved Attribute-Focused T2I Personalization

Ye Wang, Zili Yi, Rui Ma

TL;DR

The paper identifies the challenge of attributing fine-grained visual characteristics from a single reference image in text-to-image personalization. It proposes OSTAF, a hypernetwork-guided, one-shot fine-tuning framework that modulates attention weights in the U-Net (encoder or decoder) to learn attribute-specific features such as appearance, shape, and style, with a lightweight hypernetwork predicting weight offsets and a controllable intensity parameter $\lambda$. Through an Attribute Benchmark and extensive quantitative and qualitative evaluations, OSTAF demonstrates superior attribute identification and customization quality compared to DreamBooth, Prospect, IP-Adapter, and ControlNet baselines, while maintaining text controllability and reasonable efficiency. The method offers practical impact by enabling precise, efficient attribute-focused personalization from a single image, with potential extensions to faster tuning and video content in future work.

Abstract

Personalized text-to-image (T2I) models not only produce lifelike and varied visuals but also allow users to tailor the images to fit their personal taste. These personalization techniques can grasp the essence of a concept through a collection of images, or adjust a pre-trained text-to-image model with a specific image input for subject-driven or attribute-aware guidance. Yet, accurately capturing the distinct visual attributes of an individual image poses a challenge for these methods. To address this issue, we introduce OSTAF, a novel parameter-efficient one-shot fine-tuning method which only utilizes one reference image for T2I personalization. A novel hypernetwork-powered attribute-focused fine-tuning mechanism is employed to achieve the precise learning of various attribute features (e.g., appearance, shape or drawing style) from the reference image. Comparing to existing image customization methods, our method shows significant superiority in attribute identification and application, as well as achieves a good balance between efficiency and output quality.

OSTAF: A One-Shot Tuning Method for Improved Attribute-Focused T2I Personalization

TL;DR

The paper identifies the challenge of attributing fine-grained visual characteristics from a single reference image in text-to-image personalization. It proposes OSTAF, a hypernetwork-guided, one-shot fine-tuning framework that modulates attention weights in the U-Net (encoder or decoder) to learn attribute-specific features such as appearance, shape, and style, with a lightweight hypernetwork predicting weight offsets and a controllable intensity parameter . Through an Attribute Benchmark and extensive quantitative and qualitative evaluations, OSTAF demonstrates superior attribute identification and customization quality compared to DreamBooth, Prospect, IP-Adapter, and ControlNet baselines, while maintaining text controllability and reasonable efficiency. The method offers practical impact by enabling precise, efficient attribute-focused personalization from a single image, with potential extensions to faster tuning and video content in future work.

Abstract

Personalized text-to-image (T2I) models not only produce lifelike and varied visuals but also allow users to tailor the images to fit their personal taste. These personalization techniques can grasp the essence of a concept through a collection of images, or adjust a pre-trained text-to-image model with a specific image input for subject-driven or attribute-aware guidance. Yet, accurately capturing the distinct visual attributes of an individual image poses a challenge for these methods. To address this issue, we introduce OSTAF, a novel parameter-efficient one-shot fine-tuning method which only utilizes one reference image for T2I personalization. A novel hypernetwork-powered attribute-focused fine-tuning mechanism is employed to achieve the precise learning of various attribute features (e.g., appearance, shape or drawing style) from the reference image. Comparing to existing image customization methods, our method shows significant superiority in attribute identification and application, as well as achieves a good balance between efficiency and output quality.
Paper Structure (16 sections, 3 equations, 16 figures, 4 tables)

This paper contains 16 sections, 3 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 1: Attribute-focused text-to-image personalization. Our method allows for the generation of customized appearance, shape and style attributes using only one reference image, as shown by the dashed frame.
  • Figure 2: (Left) Illustration showcasing the unique roles of the encoder and decoder within the diffusion U-Net in learning varying attributes. (Right) A display of the outcomes achieved without tuning the hypernetwork, suggesting its effectiveness is limited to simple iconic reference images.
  • Figure 3: OSTAF pipeline. Our method requires only one reference image as input, and we introduce a hypernetwork-driven fine-tuning approach to adjust the parameters of the U-net encoder or decoder for efficient attribute-focused T2I customization.
  • Figure 4: The architecture of hypernetwork.
  • Figure 5: The comparison of generation results between IP-Adapter and our method. Sub-figure (a) showcases the comparison of appearance customization results, while sub-figures (b) and (c) present the comparison of shape customization results.
  • ...and 11 more figures