MixDiff: Mixing Natural and Synthetic Images for Robust Self-Supervised Representations
Reza Akbarian Bafghi, Nidhin Harilal, Claire Monteleoni, Maziar Raissi
TL;DR
MixDiff addresses SSL data efficiency and distribution shifts by replacing an augmented view with a diffusion-generated synthetic image, enabling cross real–synthetic representation learning across SimCLR, BarlowTwins, and DINO. The approach uses an image-to-image diffusion variant (IVD) to generate $ ilde{x_i}$ from real input $x_i$ and integrates this into existing joint-embedding SSL losses, yielding improved robustness and transfer without requiring labeled data. Empirically, MixDiff boosts robustness to domain shifts and transfer performance, reduces dependence on heavy augmentations, and enables competitive or superior performance with less real data. The method is demonstrated to be robust to synthetic image quality, generalizes across diffusion models (SD/VD), and offers practical data-efficiency benefits for SSL pre-training with potentially lower annotation costs and faster adaptation to new domains.
Abstract
This paper introduces MixDiff, a new self-supervised learning (SSL) pre-training framework that combines real and synthetic images. Unlike traditional SSL methods that predominantly use real images, MixDiff uses a variant of Stable Diffusion to replace an augmented instance of a real image, facilitating the learning of cross real-synthetic image representations. Our key insight is that while models trained solely on synthetic images underperform, combining real and synthetic data leads to more robust and adaptable representations. Experiments show MixDiff enhances SimCLR, BarlowTwins, and DINO across various robustness datasets and domain transfer tasks, boosting SimCLR's ImageNet-1K accuracy by 4.56%. Our framework also demonstrates comparable performance without needing any augmentations, a surprising finding in SSL where augmentations are typically crucial.
