Table of Contents
Fetching ...

Regularized Training with Generated Datasets for Name-Only Transfer of Vision-Language Models

Minho Park, Sunghyun Park, Jooyeol Yun, Jaegul Choo

TL;DR

The paper tackles the domain gap that arises when fine-tuning vision-language models on synthetic, generated data for name-only transfer. It introduces two regularizations: a post-training weight-space ensemble (WSE) that blends zero-shot and fine-tuned classifiers via $\theta_{WSE}=(1-\alpha)\theta_{ZS}+\alpha\theta_{FT}$, and a training-time variance-covariance regularization (VCR) that promotes feature diversity. By linking diversity metrics ${\mathcal{D}}_{Mag}$ and ${\mathcal{D}}_{Dir}$ to real-domain performance, the authors justify the use of VCR to prevent domain-specific overfitting and to encourage richer representations. Across 11 diverse datasets and multiple generation models, the combined approach yields state-of-the-art results for name-only transfer and extends effectively to few-shot classification without introducing extra parameters beyond the enhanced encoder. The work offers practical solutions for leveraging generated data in perception tasks where real data is scarce, with broad applicability to other vision tasks beyond classification.

Abstract

Recent advancements in text-to-image generation have inspired researchers to generate datasets tailored for perception models using generative models, which prove particularly valuable in scenarios where real-world data is limited. In this study, our goal is to address the challenges when fine-tuning vision-language models (e.g., CLIP) on generated datasets. Specifically, we aim to fine-tune vision-language models to a specific classification model without access to any real images, also known as name-only transfer. However, despite the high fidelity of generated images, we observed a significant performance degradation when fine-tuning the model using the generated datasets due to the domain gap between real and generated images. To overcome the domain gap, we provide two regularization methods for training and post-training, respectively. First, we leverage the domain-agnostic knowledge from the original pre-trained vision-language model by conducting the weight-space ensemble of the fine-tuned model on the generated dataset with the original pre-trained model at the post-training. Secondly, we reveal that fine-tuned models with high feature diversity score high performance in the real domain, which indicates that increasing feature diversity prevents learning the generated domain-specific knowledge. Thus, we encourage feature diversity by providing additional regularization at training time. Extensive experiments on various classification datasets and various text-to-image generation models demonstrated that our analysis and regularization techniques effectively mitigate the domain gap, which has long been overlooked, and enable us to achieve state-of-the-art performance by training with generated images. Code is available at https://github.com/pmh9960/regft-for-gen

Regularized Training with Generated Datasets for Name-Only Transfer of Vision-Language Models

TL;DR

The paper tackles the domain gap that arises when fine-tuning vision-language models on synthetic, generated data for name-only transfer. It introduces two regularizations: a post-training weight-space ensemble (WSE) that blends zero-shot and fine-tuned classifiers via , and a training-time variance-covariance regularization (VCR) that promotes feature diversity. By linking diversity metrics and to real-domain performance, the authors justify the use of VCR to prevent domain-specific overfitting and to encourage richer representations. Across 11 diverse datasets and multiple generation models, the combined approach yields state-of-the-art results for name-only transfer and extends effectively to few-shot classification without introducing extra parameters beyond the enhanced encoder. The work offers practical solutions for leveraging generated data in perception tasks where real data is scarce, with broad applicability to other vision tasks beyond classification.

Abstract

Recent advancements in text-to-image generation have inspired researchers to generate datasets tailored for perception models using generative models, which prove particularly valuable in scenarios where real-world data is limited. In this study, our goal is to address the challenges when fine-tuning vision-language models (e.g., CLIP) on generated datasets. Specifically, we aim to fine-tune vision-language models to a specific classification model without access to any real images, also known as name-only transfer. However, despite the high fidelity of generated images, we observed a significant performance degradation when fine-tuning the model using the generated datasets due to the domain gap between real and generated images. To overcome the domain gap, we provide two regularization methods for training and post-training, respectively. First, we leverage the domain-agnostic knowledge from the original pre-trained vision-language model by conducting the weight-space ensemble of the fine-tuned model on the generated dataset with the original pre-trained model at the post-training. Secondly, we reveal that fine-tuned models with high feature diversity score high performance in the real domain, which indicates that increasing feature diversity prevents learning the generated domain-specific knowledge. Thus, we encourage feature diversity by providing additional regularization at training time. Extensive experiments on various classification datasets and various text-to-image generation models demonstrated that our analysis and regularization techniques effectively mitigate the domain gap, which has long been overlooked, and enable us to achieve state-of-the-art performance by training with generated images. Code is available at https://github.com/pmh9960/regft-for-gen
Paper Structure (39 sections, 4 equations, 14 figures, 8 tables)

This paper contains 39 sections, 4 equations, 14 figures, 8 tables.

Figures (14)

  • Figure 1: (a) Fréchet Inception Distance (FID) fid of the intra-domain and inter-domain represents the significant domain gap between the real and generated images. (b) Accuracy of real and generated ImageNet imagenet across the original pre-trained vision-language model (e.g., CLIP clip) and the fine-tuned models. Fine-tuning on the specific domain often leads to performance degradation of the other domain, as shown in both real and generated domains. In this study, we aim to improve the real-domain accuracy by overcoming the domain gap with regularization techniques.
  • Figure 2: Overview of the architecture for name-only transfer of vision-language models (e.g., CLIP clip). While the preceding approaches focused on enriching prompts (green) and adapters (yellow), we aim to fine-tuning the CLIP image encoder (red) with generated datasets.
  • Figure 2: Ablation studies of the fine-tune classifier, weight-space ensemble and variance-covariance regularization on 11 datasets.
  • Figure 3: Overview of the proposed method. Initially, generated datasets are synthesized from textural conditions via text-to-image generation models. Subsequently, the entire classifier is fine-tuned on the generated dataset, employing cross-entropy loss (${\mathcal{L}}_\text{CE}$) with variance-covariance regularization (${\mathcal{L}}_\text{VCR}$). Lastly, a weight-space ensemble is performed to integrate the zero-shot classifier and the fine-tuned classifier.
  • Figure 4: Evaluating magnitude diversity, direction diversity, and the real ImageNet imagenet accuracy of fine-tuned classifiers with the generated dataset. The results indicate a strong correlation between the diversity and robustness of the real domain. According to the observation, we successfully improved the performance in the real domain via both regularization methods.
  • ...and 9 more figures