Table of Contents
Fetching ...

FiVA: Fine-grained Visual Attribute Dataset for Text-to-Image Diffusion Models

Tong Wu, Yinghao Xu, Ryan Po, Mengchen Zhang, Guandao Yang, Jiaqi Wang, Ziwei Liu, Dahua Lin, Gordon Wetzstein

TL;DR

The paper tackles the challenge of controlling fine-grained visual attributes in text-to-image diffusion models by introducing FiVA, a large-scale dataset with a detailed attribute taxonomy (~1M generated images) and a data-generation pipeline that uses range-sensitive filtering and human validation. It then presents FiVA-Adapter, a multimodal adaptation framework that decouples and combines multiple attribute cues using an Attribute-specific Visual Prompt Extractor with a Q-Former and a Multi-image Dual Cross-Attention module to inject attribute signals into diffusion generation. Empirical results show improved subject fidelity and attribute-text alignment, along with the ability to mix attributes from different references, validated through both quantitative metrics (CLIP scores, Attr&Sub accuracy) and qualitative analyses. The work advances practical controllable image generation, enabling user-friendly, fine-grained manipulation of attributes across domains, with potential applications in photography, art, and design.

Abstract

Recent advances in text-to-image generation have enabled the creation of high-quality images with diverse applications. However, accurately describing desired visual attributes can be challenging, especially for non-experts in art and photography. An intuitive solution involves adopting favorable attributes from the source images. Current methods attempt to distill identity and style from source images. However, "style" is a broad concept that includes texture, color, and artistic elements, but does not cover other important attributes such as lighting and dynamics. Additionally, a simplified "style" adaptation prevents combining multiple attributes from different sources into one generated image. In this work, we formulate a more effective approach to decompose the aesthetics of a picture into specific visual attributes, allowing users to apply characteristics such as lighting, texture, and dynamics from different images. To achieve this goal, we constructed the first fine-grained visual attributes dataset (FiVA) to the best of our knowledge. This FiVA dataset features a well-organized taxonomy for visual attributes and includes around 1 M high-quality generated images with visual attribute annotations. Leveraging this dataset, we propose a fine-grained visual attribute adaptation framework (FiVA-Adapter), which decouples and adapts visual attributes from one or more source images into a generated one. This approach enhances user-friendly customization, allowing users to selectively apply desired attributes to create images that meet their unique preferences and specific content requirements.

FiVA: Fine-grained Visual Attribute Dataset for Text-to-Image Diffusion Models

TL;DR

The paper tackles the challenge of controlling fine-grained visual attributes in text-to-image diffusion models by introducing FiVA, a large-scale dataset with a detailed attribute taxonomy (~1M generated images) and a data-generation pipeline that uses range-sensitive filtering and human validation. It then presents FiVA-Adapter, a multimodal adaptation framework that decouples and combines multiple attribute cues using an Attribute-specific Visual Prompt Extractor with a Q-Former and a Multi-image Dual Cross-Attention module to inject attribute signals into diffusion generation. Empirical results show improved subject fidelity and attribute-text alignment, along with the ability to mix attributes from different references, validated through both quantitative metrics (CLIP scores, Attr&Sub accuracy) and qualitative analyses. The work advances practical controllable image generation, enabling user-friendly, fine-grained manipulation of attributes across domains, with potential applications in photography, art, and design.

Abstract

Recent advances in text-to-image generation have enabled the creation of high-quality images with diverse applications. However, accurately describing desired visual attributes can be challenging, especially for non-experts in art and photography. An intuitive solution involves adopting favorable attributes from the source images. Current methods attempt to distill identity and style from source images. However, "style" is a broad concept that includes texture, color, and artistic elements, but does not cover other important attributes such as lighting and dynamics. Additionally, a simplified "style" adaptation prevents combining multiple attributes from different sources into one generated image. In this work, we formulate a more effective approach to decompose the aesthetics of a picture into specific visual attributes, allowing users to apply characteristics such as lighting, texture, and dynamics from different images. To achieve this goal, we constructed the first fine-grained visual attributes dataset (FiVA) to the best of our knowledge. This FiVA dataset features a well-organized taxonomy for visual attributes and includes around 1 M high-quality generated images with visual attribute annotations. Leveraging this dataset, we propose a fine-grained visual attribute adaptation framework (FiVA-Adapter), which decouples and adapts visual attributes from one or more source images into a generated one. This approach enhances user-friendly customization, allowing users to selectively apply desired attributes to create images that meet their unique preferences and specific content requirements.

Paper Structure

This paper contains 29 sections, 2 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Overview. We propose the FiVA dataset and adapter to learn fine-grained visual attributes for better controllable image generation.
  • Figure 2: Examples of visual consistency application range. Some visual attributes, such as 'color' and 'stroke,' are easily transferable across different subjects (left). However, other attributes, like 'lighting' and 'dynamics,' are range-sensitive, meaning they produce varying visual effects depending on the domain (right), resulting in more fine-grained, subject-specific definitions of sub-attributes.
  • Figure 3: FiVA-Adapter architecture and training pipeline. FiVA-Adapter has two key designs: 1) Attribute-specific Visual Prompt Extractor, 2) Multi-image Dual Cross-Attention Module.
  • Figure 4: Qualitative comparisons on single attribute transferring.
  • Figure 5: The combination of multiple visual attributes enables the integration of specific characteristics from different reference images into the target subject.
  • ...and 8 more figures