Table of Contents
Fetching ...

Collaborative Control for Geometry-Conditioned PBR Image Generation

Shimon Vainer, Mark Boss, Mathias Parger, Konstantin Kutsy, Dante De Nigris, Ciara Rowles, Nicolas Perony, Simon Donné

TL;DR

This work tackles the challenge of generating physically-based rendering (PBR) textures conditioned on geometry without relying on photometrically inconsistent RGB outputs. It introduces Collaborative Control, a bidirectional cross-network framework that keeps a pretrained RGB diffusion model frozen while training a parallel PBR model to model the joint distribution $p(\bm{z}_{rgb}, \bm{z}_{pbr})$, enabling direct PBR texture generation. A dedicated PBR VAE with $14$ latent channels and Objaverse-based training data underpin efficient latent compression and data effectiveness, while cross-network communication ensures coherent interaction between modalities. The approach demonstrates strong distributional and OOD performance, compatibility with IP-Adapter, and practical potential for Text-to-Texture pipelines, albeit with limitations related to data biases in PBR maps and computational cost.

Abstract

Graphics pipelines require physically-based rendering (PBR) materials, yet current 3D content generation approaches are built on RGB models. We propose to model the PBR image distribution directly, avoiding photometric inaccuracies in RGB generation and the inherent ambiguity in extracting PBR from RGB. As existing paradigms for cross-modal fine-tuning are not suited for PBR generation due to both a lack of data and the high dimensionality of the output modalities, we propose to train a new PBR model that is tightly linked to a frozen RGB model using a novel cross-network communication paradigm. As the base RGB model is fully frozen, the proposed method retains its general performance and remains compatible with e.g. IPAdapters for that base model.

Collaborative Control for Geometry-Conditioned PBR Image Generation

TL;DR

This work tackles the challenge of generating physically-based rendering (PBR) textures conditioned on geometry without relying on photometrically inconsistent RGB outputs. It introduces Collaborative Control, a bidirectional cross-network framework that keeps a pretrained RGB diffusion model frozen while training a parallel PBR model to model the joint distribution , enabling direct PBR texture generation. A dedicated PBR VAE with latent channels and Objaverse-based training data underpin efficient latent compression and data effectiveness, while cross-network communication ensures coherent interaction between modalities. The approach demonstrates strong distributional and OOD performance, compatibility with IP-Adapter, and practical potential for Text-to-Texture pipelines, albeit with limitations related to data biases in PBR maps and computational cost.

Abstract

Graphics pipelines require physically-based rendering (PBR) materials, yet current 3D content generation approaches are built on RGB models. We propose to model the PBR image distribution directly, avoiding photometric inaccuracies in RGB generation and the inherent ambiguity in extracting PBR from RGB. As existing paradigms for cross-modal fine-tuning are not suited for PBR generation due to both a lack of data and the high dimensionality of the output modalities, we propose to train a new PBR model that is tightly linked to a frozen RGB model using a novel cross-network communication paradigm. As the base RGB model is fully frozen, the proposed method retains its general performance and remains compatible with e.g. IPAdapters for that base model.
Paper Structure (33 sections, 1 equation, 10 figures, 1 table)

This paper contains 33 sections, 1 equation, 10 figures, 1 table.

Figures (10)

  • Figure 1: Generated PBR materials. By tightly linking the PBR diffusion model with a frozen RGB model, we produce high-quality PBR images conditioned on geometry and prompts. Visit the project page at https://unity-research.github.io/holo-gen.
  • Figure 2: Collaborative Control. Two parallel models collaborate to generate pixel-aligned outputs of different modalities. We freeze the left pre-trained RGB model and train the right PBR model with its cross-network communication layers. The cross-communication concatenates the states of both models, processes them with a small MLP, and residually distributes the result back to the respective models. As discussed in \ref{['sec:results']}, prompt cross-attention in the PBR model is counter-productive.
  • Figure 3: Bump map. Similar surface bumps in world space (left) are dissimilar in the UV tangent space (middle) because of the arbitrary UV mapping. Representing the bump map in a tangent space solely dependent on the geometry (right) resolves this issue.
  • Figure 4: Rendering function. The dataset is constructed so that the lighting remains constant with respect to the camera, simplifying the rendering function $f_{RGB}$: notice the similar highlight location.
  • Figure 5: High-level overview of communication in (a) ControlNet zhang2023adding, (b) ControlNet-XS zavadski2023controlnetxs, (c) AnimateAnyone hu2023animateanyone and (d) our proposed Collaborative Control approach. Blue represents frozen blocks, while orange elements are optimized during training.
  • ...and 5 more figures