MixDiff: Mixing Natural and Synthetic Images for Robust Self-Supervised Representations

Reza Akbarian Bafghi; Nidhin Harilal; Claire Monteleoni; Maziar Raissi

MixDiff: Mixing Natural and Synthetic Images for Robust Self-Supervised Representations

Reza Akbarian Bafghi, Nidhin Harilal, Claire Monteleoni, Maziar Raissi

TL;DR

MixDiff addresses SSL data efficiency and distribution shifts by replacing an augmented view with a diffusion-generated synthetic image, enabling cross real–synthetic representation learning across SimCLR, BarlowTwins, and DINO. The approach uses an image-to-image diffusion variant (IVD) to generate $ ilde{x_i}$ from real input $x_i$ and integrates this into existing joint-embedding SSL losses, yielding improved robustness and transfer without requiring labeled data. Empirically, MixDiff boosts robustness to domain shifts and transfer performance, reduces dependence on heavy augmentations, and enables competitive or superior performance with less real data. The method is demonstrated to be robust to synthetic image quality, generalizes across diffusion models (SD/VD), and offers practical data-efficiency benefits for SSL pre-training with potentially lower annotation costs and faster adaptation to new domains.

Abstract

This paper introduces MixDiff, a new self-supervised learning (SSL) pre-training framework that combines real and synthetic images. Unlike traditional SSL methods that predominantly use real images, MixDiff uses a variant of Stable Diffusion to replace an augmented instance of a real image, facilitating the learning of cross real-synthetic image representations. Our key insight is that while models trained solely on synthetic images underperform, combining real and synthetic data leads to more robust and adaptable representations. Experiments show MixDiff enhances SimCLR, BarlowTwins, and DINO across various robustness datasets and domain transfer tasks, boosting SimCLR's ImageNet-1K accuracy by 4.56%. Our framework also demonstrates comparable performance without needing any augmentations, a surprising finding in SSL where augmentations are typically crucial.

MixDiff: Mixing Natural and Synthetic Images for Robust Self-Supervised Representations

TL;DR

from real input

and integrates this into existing joint-embedding SSL losses, yielding improved robustness and transfer without requiring labeled data. Empirically, MixDiff boosts robustness to domain shifts and transfer performance, reduces dependence on heavy augmentations, and enables competitive or superior performance with less real data. The method is demonstrated to be robust to synthetic image quality, generalizes across diffusion models (SD/VD), and offers practical data-efficiency benefits for SSL pre-training with potentially lower annotation costs and faster adaptation to new domains.

Abstract

Paper Structure (43 sections, 6 equations, 11 figures, 7 tables)

This paper contains 43 sections, 6 equations, 11 figures, 7 tables.

Introduction
Related Work
Self-supervised Learning.
Learning using Synthetic Data.
Generative Models.
Method
Description of MixDiff
Mixing in joint-embedding SSL
SimCLR + MixDiff:
Barlow Twins + MixDiff:
Mixing in Distillation SSL
DINO + MixDiff:
Experiments
Training Algorithms and Data.
MixDiff boosts robustness to distribution shifts
...and 28 more sections

Figures (11)

Figure 1: Comparison of SimCLR performance on real, synthetic (Syn), and mixed real and synthetic images (MixDiff). The radar charts show normalized accuracy across 8 transfer learning datasets (left) and ImageNet-1K plus 6 distribution shift datasets (right), with values from 0.5 to 1.1. MixDiff enhances in-distribution and robustness performance and generalizes better. More details in Sec. \ref{['sec:exp']}.
Figure 2: Existing SSL methods, including (A) SimCLR, (B) Barlow Twins, and (C) DINO, have been enhanced with our novel MixDiff approach. In both (A) SimCLR and (B) Barlow Twins, we replace a branch representing the positive pair with a synthetic image generated without the label using Stable Diffusion. This modification enables the learning of real-synthetic view prediction. (C) DINO utilizes a distillation framework with two global views for the teacher and a mix of two global and eight local views for the student. Our adaptation integrates a blend of global and local synthetic and real images facilitating learning correspondences between global-to-local on top of real-to-synthetic image views.
Figure 3: Top-1 classification accuracies (%) for various models on ImageNet-100 (x-axis) and the average of four domain shift datasets (y-axis). This figure compares the performance of models trained on real, synthetic (Syn), and an equal combination of real and synthetic images (MixDiff). Models in the top-right quadrant exhibit better in-distribution and out-of-distribution accuracies.
Figure 4: Left: Top-1 accuracy on IN-100 for SimCLR models trained with and without MixDiff at different scales of training images. Right: Average top-1 accuracy on four distribution shift datasets. SimCLR+MixDiff outperforms SimCLR, as indicated by the green area showing the performance gap.
Figure 5: Top-1 classification accuracies for various models on ImageNet-100 (x-axis) and the average of four domain shift datasets (y-axis). The models use different approaches for mixing real and synthetic images, as well as varying generation models.
...and 6 more figures

MixDiff: Mixing Natural and Synthetic Images for Robust Self-Supervised Representations

TL;DR

Abstract

MixDiff: Mixing Natural and Synthetic Images for Robust Self-Supervised Representations

Authors

TL;DR

Abstract

Table of Contents

Figures (11)