More Than Generation: Unifying Generation and Depth Estimation via Text-to-Image Diffusion Models
Hongkai Lin, Dingkang Liang, Mingyang Du, Xin Zhou, Xiang Bai
TL;DR
This work addresses unifying image generation and depth estimation within diffusion models by starting from a fixed pre-trained text-to-image (T2I) model and adding lightweight converters. The proposed MERGE framework employs a play-and-plug approach with group sharing of converters to transform T2I features for depth estimation while preserving the original generation capability. Empirically, MERGE achieves state-of-the-art like depth performance on NYUv2, ScanNet, and DIODE with only about 12% additional trainable parameters and can generalize to zero-shot normal estimation, demonstrating a cost-effective path to unified generation-perception models. The approach highlights the value of leveraging rich visual priors in fixed T2I models through efficient converter design and parameter reuse, reducing data and compute requirements compared to full-parameter fine-tuning or training from scratch.
Abstract
Generative depth estimation methods leverage the rich visual priors stored in pre-trained text-to-image diffusion models, demonstrating astonishing zero-shot capability. However, parameter updates during training lead to catastrophic degradation in the image generation capability of the pre-trained model. We introduce MERGE, a unified model for image generation and depth estimation, starting from a fixed pre-trained text-to-image model. MERGE demonstrates that the pre-trained text-to-image model can do more than image generation, but also expand to depth estimation effortlessly. Specifically, MERGE introduces a play-and-plug framework that enables seamless switching between image generation and depth estimation modes through simple and pluggable converters. Meanwhile, we propose a Group Reuse Mechanism to encourage parameter reuse and improve the utilization of the additional learnable parameters. MERGE unleashes the powerful depth estimation capability of the pre-trained text-to-image model while preserving its original image generation ability. Compared to other unified models for image generation and depth estimation, MERGE achieves state-of-the-art performance across multiple depth estimation benchmarks. The code will be made available at https://github.com/H-EmbodVis/MERGE
