Table of Contents
Fetching ...

More Than Generation: Unifying Generation and Depth Estimation via Text-to-Image Diffusion Models

Hongkai Lin, Dingkang Liang, Mingyang Du, Xin Zhou, Xiang Bai

TL;DR

This work addresses unifying image generation and depth estimation within diffusion models by starting from a fixed pre-trained text-to-image (T2I) model and adding lightweight converters. The proposed MERGE framework employs a play-and-plug approach with group sharing of converters to transform T2I features for depth estimation while preserving the original generation capability. Empirically, MERGE achieves state-of-the-art like depth performance on NYUv2, ScanNet, and DIODE with only about 12% additional trainable parameters and can generalize to zero-shot normal estimation, demonstrating a cost-effective path to unified generation-perception models. The approach highlights the value of leveraging rich visual priors in fixed T2I models through efficient converter design and parameter reuse, reducing data and compute requirements compared to full-parameter fine-tuning or training from scratch.

Abstract

Generative depth estimation methods leverage the rich visual priors stored in pre-trained text-to-image diffusion models, demonstrating astonishing zero-shot capability. However, parameter updates during training lead to catastrophic degradation in the image generation capability of the pre-trained model. We introduce MERGE, a unified model for image generation and depth estimation, starting from a fixed pre-trained text-to-image model. MERGE demonstrates that the pre-trained text-to-image model can do more than image generation, but also expand to depth estimation effortlessly. Specifically, MERGE introduces a play-and-plug framework that enables seamless switching between image generation and depth estimation modes through simple and pluggable converters. Meanwhile, we propose a Group Reuse Mechanism to encourage parameter reuse and improve the utilization of the additional learnable parameters. MERGE unleashes the powerful depth estimation capability of the pre-trained text-to-image model while preserving its original image generation ability. Compared to other unified models for image generation and depth estimation, MERGE achieves state-of-the-art performance across multiple depth estimation benchmarks. The code will be made available at https://github.com/H-EmbodVis/MERGE

More Than Generation: Unifying Generation and Depth Estimation via Text-to-Image Diffusion Models

TL;DR

This work addresses unifying image generation and depth estimation within diffusion models by starting from a fixed pre-trained text-to-image (T2I) model and adding lightweight converters. The proposed MERGE framework employs a play-and-plug approach with group sharing of converters to transform T2I features for depth estimation while preserving the original generation capability. Empirically, MERGE achieves state-of-the-art like depth performance on NYUv2, ScanNet, and DIODE with only about 12% additional trainable parameters and can generalize to zero-shot normal estimation, demonstrating a cost-effective path to unified generation-perception models. The approach highlights the value of leveraging rich visual priors in fixed T2I models through efficient converter design and parameter reuse, reducing data and compute requirements compared to full-parameter fine-tuning or training from scratch.

Abstract

Generative depth estimation methods leverage the rich visual priors stored in pre-trained text-to-image diffusion models, demonstrating astonishing zero-shot capability. However, parameter updates during training lead to catastrophic degradation in the image generation capability of the pre-trained model. We introduce MERGE, a unified model for image generation and depth estimation, starting from a fixed pre-trained text-to-image model. MERGE demonstrates that the pre-trained text-to-image model can do more than image generation, but also expand to depth estimation effortlessly. Specifically, MERGE introduces a play-and-plug framework that enables seamless switching between image generation and depth estimation modes through simple and pluggable converters. Meanwhile, we propose a Group Reuse Mechanism to encourage parameter reuse and improve the utilization of the additional learnable parameters. MERGE unleashes the powerful depth estimation capability of the pre-trained text-to-image model while preserving its original image generation ability. Compared to other unified models for image generation and depth estimation, MERGE achieves state-of-the-art performance across multiple depth estimation benchmarks. The code will be made available at https://github.com/H-EmbodVis/MERGE

Paper Structure

This paper contains 12 sections, 2 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: We present Merge, a simple unified diffusion model for image generation and depth estimation. Its core lies in leveraging streamlined converters and rich visual prior stored in generative image models. Our model, derived from fixed image generation models and fine-tuned pluggable converters with synthetic data, expands powerful zero-shot depth estimation capability.
  • Figure 2: The comparison between existing methods and ours shows that, unlike previous works, our method requires only a few additional parameters to unleash its powerful depth estimation capability without compromising its inherent T2I generation ability.
  • Figure 3: The pipeline of Merge. Starting from the fixed DiT-based text-to-image (T2I) model, where transformer layers (hereafter referred to as T2I blocks) are divided into different groups. A shared and learnable converter is inserted before each T2I block within a group, transforming it into a depth estimation model. It can be reverted to the original T2I model by skipping these converters.
  • Figure 4: The cosine similarity between the output features of different T2I blocks within the PixArt chen2024pixart model.
  • Figure 5: The process of converter simplification, exemplified by PixArt chen2024pixart.
  • ...and 1 more figures