Role-SynthCLIP: A Role Play Driven Diverse Synthetic Data Approach
Yuanxiang Huangfu, Chaochao Wang, Weilei Wang
TL;DR
Role-SynthCLIP addresses the semantic diversity bottleneck in synthetic CLIP data by employing multi-expert role-playing prompts to generate diverse, perspective-aware captions from multimodal large language models. It combines expert-role generation, role-consistent captioning, a GPT-5–driven role-aware filter, and Long-CLIP style long-caption adaptation with a multi-positive contrastive loss to synthesize high-quality image-text pairs without increasing data volume. The approach yields state-of-the-art results in data-efficient CLIP training, exemplified by a CLIP-B/16 model trained on 1 million Role-SynthCLIP pairs achieving Recall@1 on COCO that surpasses larger synthetic baselines, and robust performance on out-of-distribution tasks. These findings demonstrate that controlled, cognitively diverse data generation can outperform sheer data quantity, enabling practical, scalable improvements in vision-language pretraining and downstream cross-modal understanding.
Abstract
The effectiveness of Contrastive Language-Image Pre-training (CLIP) models critically depends on the semantic diversity and quality of their training data. However, while existing synthetic data generation methods primarily focus on increasing data volume, such emphasis often leads to limited semantic diversity and redundant or shallow captions. To address this limitation, we propose Role-SynthCLIP, a novel data synthesis framework that leverages multi-perspective role-playing prompts (e.g., a compositional analyst, an interpreter of image context) to guide Multimodal Large Language Models (MLLMs) in generating semantically diverse captions from distinct viewpoints. This mechanism enhances the semantic diversity and fine-grained image-text alignment of synthetic pairs, thereby improving caption expressiveness and accuracy while keeping the total number of image-text pairs unchanged. Experimental results demonstrate the effectiveness and efficiency of our method. A CLIP-B/16 model trained on only 1 million Role-SynthCLIP pairs achieves a Recall@1 of 64.1% on the MS COCO validation set, surpassing the best existing synthetic data baseline (trained on 5M pairs) by 2.8 percentage points. The code and trained models are released at https://github.com/huangfu170/Role-SynthCLIP.
