ViscoNet: Bridging and Harmonizing Visual and Textual Conditioning for ControlNet

Soon Yau Cheong; Armin Mustafa; Andrew Gilbert

ViscoNet: Bridging and Harmonizing Visual and Textual Conditioning for ControlNet

Soon Yau Cheong, Armin Mustafa, Andrew Gilbert

TL;DR

ViscoNet, a novel one-branch-adapter architecture for concurrent spatial and visual conditioning, demonstrates outstanding capabilities in achieving a harmonious visual-text balance, unlocking unparalleled versatility in various human image generation tasks.

Abstract

This paper introduces ViscoNet, a novel one-branch-adapter architecture for concurrent spatial and visual conditioning. Our lightweight model requires trainable parameters and dataset size multiple orders of magnitude smaller than the current state-of-the-art IP-Adapter. However, our method successfully preserves the generative power of the frozen text-to-image (T2I) backbone. Notably, it excels in addressing mode collapse, a pervasive issue previously overlooked. Our novel architecture demonstrates outstanding capabilities in achieving a harmonious visual-text balance, unlocking unparalleled versatility in various human image generation tasks, including pose re-targeting, virtual try-on, stylization, person re-identification, and textile transfer.Demo and code are available from project page https://soon-yau.github.io/visconet/ .

ViscoNet: Bridging and Harmonizing Visual and Textual Conditioning for ControlNet

TL;DR

Abstract

Paper Structure (27 sections, 3 equations, 25 figures, 3 tables)

This paper contains 27 sections, 3 equations, 25 figures, 3 tables.

Introduction
Related Works
Method
Preliminaries
Replace Text with Visual Prompt
Control Feature Masking
Harmonizing Text and Visual Influence
Training Setup
Image Resolution
Experiments
Mode Collapse and Control Strength
Re-identification
Generating Diverse Human Image Styles
Ablations
Limitations
...and 12 more sections

Figures (25)

Figure 1: Our proposed Visconet demonstrates broad versatility in multimodal human image tasks including visual prompts, pose re-target, virtual try-on, re-identification using either text or visual prompt, text prompt, texture transfer, stylization and latent space interpolation to perform human morphing.
Figure 2: To motivate our work, this figure illustrates how increasing text complexity in ControlNet controlnet can expose (c) domain gap and eventually lead to mode collapse in (d). IP-Adapter ip-adapter also exhibits (e) catastrophic forgetting, resulting in the inability to generate a rich background. Both show the concept of bleeding by assigning the wrong color to clothing garments.
Figure 3: Our method retains generative power of the T2I backbone in (a)-(d) various image styles and rich backgrounds while maintaining the person and clothing appearance, assigning correct clothing colors. In (e), we can control the level of stylization to expand it to the clothing styles.
Figure 4: Architectural diagram showing our contribution concerning backbone LDM and ControlNet layers. We omit time embedding, zero convolution, and some blocks from the ControlNet diagram controlnet for simplicity.
Figure 5: Effect of control strength (%). Compared to ControlNet and IP Adapter, our method can escape mode collapse faster, generating a harmonious image style while maintaining good visual control.
...and 20 more figures

ViscoNet: Bridging and Harmonizing Visual and Textual Conditioning for ControlNet

TL;DR

Abstract

ViscoNet: Bridging and Harmonizing Visual and Textual Conditioning for ControlNet

Authors

TL;DR

Abstract

Table of Contents

Figures (25)