Table of Contents
Fetching ...

DreamRenderer: Taming Multi-Instance Attribute Control in Large-Scale Text-to-Image Models

Dewei Zhou, Mingwei Li, Zongxin Yang, Yi Yang

TL;DR

DreamRenderer tackles multi-instance attribute control in large-scale text-to-image models conditioned on depth or canny inputs. It introduces Bridge Image Tokens for hard Text Attribute Binding and restricts Hard Image Attribute Binding to mid-layer Joint Attention to preserve image quality while achieving precise per-instance control. Empirical results on COCO-POS and COCO-MIG show substantial gains in Image Success Ratio and per-instance fidelity, including effective re-rendering of MIG frameworks. The approach is training-free and layer-aware, offering a practical, scalable method for improving multi-instance generation in image-conditioned diffusion models.

Abstract

Image-conditioned generation methods, such as depth- and canny-conditioned approaches, have demonstrated remarkable abilities for precise image synthesis. However, existing models still struggle to accurately control the content of multiple instances (or regions). Even state-of-the-art models like FLUX and 3DIS face challenges, such as attribute leakage between instances, which limits user control. To address these issues, we introduce DreamRenderer, a training-free approach built upon the FLUX model. DreamRenderer enables users to control the content of each instance via bounding boxes or masks, while ensuring overall visual harmony. We propose two key innovations: 1) Bridge Image Tokens for Hard Text Attribute Binding, which uses replicated image tokens as bridge tokens to ensure that T5 text embeddings, pre-trained solely on text data, bind the correct visual attributes for each instance during Joint Attention; 2) Hard Image Attribute Binding applied only to vital layers. Through our analysis of FLUX, we identify the critical layers responsible for instance attribute rendering and apply Hard Image Attribute Binding only in these layers, using soft binding in the others. This approach ensures precise control while preserving image quality. Evaluations on the COCO-POS and COCO-MIG benchmarks demonstrate that DreamRenderer improves the Image Success Ratio by 17.7% over FLUX and enhances the performance of layout-to-image models like GLIGEN and 3DIS by up to 26.8%. Project Page: https://limuloo.github.io/DreamRenderer/.

DreamRenderer: Taming Multi-Instance Attribute Control in Large-Scale Text-to-Image Models

TL;DR

DreamRenderer tackles multi-instance attribute control in large-scale text-to-image models conditioned on depth or canny inputs. It introduces Bridge Image Tokens for hard Text Attribute Binding and restricts Hard Image Attribute Binding to mid-layer Joint Attention to preserve image quality while achieving precise per-instance control. Empirical results on COCO-POS and COCO-MIG show substantial gains in Image Success Ratio and per-instance fidelity, including effective re-rendering of MIG frameworks. The approach is training-free and layer-aware, offering a practical, scalable method for improving multi-instance generation in image-conditioned diffusion models.

Abstract

Image-conditioned generation methods, such as depth- and canny-conditioned approaches, have demonstrated remarkable abilities for precise image synthesis. However, existing models still struggle to accurately control the content of multiple instances (or regions). Even state-of-the-art models like FLUX and 3DIS face challenges, such as attribute leakage between instances, which limits user control. To address these issues, we introduce DreamRenderer, a training-free approach built upon the FLUX model. DreamRenderer enables users to control the content of each instance via bounding boxes or masks, while ensuring overall visual harmony. We propose two key innovations: 1) Bridge Image Tokens for Hard Text Attribute Binding, which uses replicated image tokens as bridge tokens to ensure that T5 text embeddings, pre-trained solely on text data, bind the correct visual attributes for each instance during Joint Attention; 2) Hard Image Attribute Binding applied only to vital layers. Through our analysis of FLUX, we identify the critical layers responsible for instance attribute rendering and apply Hard Image Attribute Binding only in these layers, using soft binding in the others. This approach ensures precise control while preserving image quality. Evaluations on the COCO-POS and COCO-MIG benchmarks demonstrate that DreamRenderer improves the Image Success Ratio by 17.7% over FLUX and enhances the performance of layout-to-image models like GLIGEN and 3DIS by up to 26.8%. Project Page: https://limuloo.github.io/DreamRenderer/.

Paper Structure

This paper contains 14 sections, 4 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Images generated using DreamRenderer. DreamRenderer is a plug-and-play controller that grants users fine-grained control over the content of each region and instance during depth- or canny-conditioned generation without any training. By leveraging Redux flux to translate images into text embeddings, it further allows users to seamlessly control generated content based directly on visual inputs.
  • Figure 2: The overview of DreamRenderer. (§ \ref{['sec:overview']})(a) The pipeline of DreamRenderer. (b) Attention maps in Joint Attention, which includes 1) Hard Text Attribute Binding (§ \ref{['sec:hard_text_binding']}), 2) Hard Image Attribute Binding (§ \ref{['sec:image_binding']}), and 3) Soft Image Attribute Binding (§ \ref{['sec:image_binding']}). In the attention maps shown in (b), rows represent queries and columns represent keys. We use different patterns to distinguish between image tokens , text tokens , and bridge image tokens , while different colors ( for an ice cat, for a fire dog and for the global text tokens and background image tokens) represent tokens from different instances.
  • Figure 3: Vital Binding Layer Search (§ \ref{['sec:image_binding']}). We apply Hard Image Attribute Binding layer by layer and observe that applying it in the FLUX model’s input or output layers degrades performance, whereas applying it in the middle layers yields improvements.
  • Figure 4: Qualitative Comparison on the COCO-POS benchmark (§ \ref{['sec:comparison']}).
  • Figure 5: Qualitative comparison on the COCO-MIG benchmark (§ \ref{['sec:comparison']}).
  • ...and 2 more figures