Table of Contents
Fetching ...

Hybrid diffusion models: combining supervised and generative pretraining for label-efficient fine-tuning of segmentation models

Bruno Sauvalle, Mathieu Salzmann

TL;DR

The paper tackles label-efficient domain adaptation for segmentation by fusing supervised pretraining on a labeled source domain with unsupervised learning via a hybrid diffusion model that captures the joint distribution $p(x,y)$. A UNet is trained on the source domain to serve as both a diffusion denoiser and a segmentation predictor, with outputs expanding to $3+K$ channels and a loss that downweights the mask term to balance learning. Two theoretical propositions establish that reverse SDE/ODE dynamics using $ ext{E}[z|x_t]$ can generate samples from $p(x,y)$, enabling a generative view of segmentation representations. Empirically, vanilla fine-tuning of the hybrid diffusion model yields superior or competitive results compared to purely supervised or unsupervised pretraining, demonstrating the practical benefit of combining pretraining paradigms for label-efficient segmentation across related domains.

Abstract

We are considering in this paper the task of label-efficient fine-tuning of segmentation models: We assume that a large labeled dataset is available and allows to train an accurate segmentation model in one domain, and that we have to adapt this model on a related domain where only a few samples are available. We observe that this adaptation can be done using two distinct methods: The first method, supervised pretraining, is simply to take the model trained on the first domain using classical supervised learning, and fine-tune it on the second domain with the available labeled samples. The second method is to perform self-supervised pretraining on the first domain using a generic pretext task in order to get high-quality representations which can then be used to train a model on the second domain in a label-efficient way. We propose in this paper to fuse these two approaches by introducing a new pretext task, which is to perform simultaneously image denoising and mask prediction on the first domain. We motivate this choice by showing that in the same way that an image denoiser conditioned on the noise level can be considered as a generative model for the unlabeled image distribution using the theory of diffusion models, a model trained using this new pretext task can be considered as a generative model for the joint distribution of images and segmentation masks under the assumption that the mapping from images to segmentation masks is deterministic. We then empirically show on several datasets that fine-tuning a model pretrained using this approach leads to better results than fine-tuning a similar model trained using either supervised or unsupervised pretraining only.

Hybrid diffusion models: combining supervised and generative pretraining for label-efficient fine-tuning of segmentation models

TL;DR

The paper tackles label-efficient domain adaptation for segmentation by fusing supervised pretraining on a labeled source domain with unsupervised learning via a hybrid diffusion model that captures the joint distribution . A UNet is trained on the source domain to serve as both a diffusion denoiser and a segmentation predictor, with outputs expanding to channels and a loss that downweights the mask term to balance learning. Two theoretical propositions establish that reverse SDE/ODE dynamics using can generate samples from , enabling a generative view of segmentation representations. Empirically, vanilla fine-tuning of the hybrid diffusion model yields superior or competitive results compared to purely supervised or unsupervised pretraining, demonstrating the practical benefit of combining pretraining paradigms for label-efficient segmentation across related domains.

Abstract

We are considering in this paper the task of label-efficient fine-tuning of segmentation models: We assume that a large labeled dataset is available and allows to train an accurate segmentation model in one domain, and that we have to adapt this model on a related domain where only a few samples are available. We observe that this adaptation can be done using two distinct methods: The first method, supervised pretraining, is simply to take the model trained on the first domain using classical supervised learning, and fine-tune it on the second domain with the available labeled samples. The second method is to perform self-supervised pretraining on the first domain using a generic pretext task in order to get high-quality representations which can then be used to train a model on the second domain in a label-efficient way. We propose in this paper to fuse these two approaches by introducing a new pretext task, which is to perform simultaneously image denoising and mask prediction on the first domain. We motivate this choice by showing that in the same way that an image denoiser conditioned on the noise level can be considered as a generative model for the unlabeled image distribution using the theory of diffusion models, a model trained using this new pretext task can be considered as a generative model for the joint distribution of images and segmentation masks under the assumption that the mapping from images to segmentation masks is deterministic. We then empirically show on several datasets that fine-tuning a model pretrained using this approach leads to better results than fine-tuning a similar model trained using either supervised or unsupervised pretraining only.
Paper Structure (26 sections, 26 equations, 2 tables)