Table of Contents
Fetching ...

Role-SynthCLIP: A Role Play Driven Diverse Synthetic Data Approach

Yuanxiang Huangfu, Chaochao Wang, Weilei Wang

TL;DR

Role-SynthCLIP addresses the semantic diversity bottleneck in synthetic CLIP data by employing multi-expert role-playing prompts to generate diverse, perspective-aware captions from multimodal large language models. It combines expert-role generation, role-consistent captioning, a GPT-5–driven role-aware filter, and Long-CLIP style long-caption adaptation with a multi-positive contrastive loss to synthesize high-quality image-text pairs without increasing data volume. The approach yields state-of-the-art results in data-efficient CLIP training, exemplified by a CLIP-B/16 model trained on 1 million Role-SynthCLIP pairs achieving Recall@1 on COCO that surpasses larger synthetic baselines, and robust performance on out-of-distribution tasks. These findings demonstrate that controlled, cognitively diverse data generation can outperform sheer data quantity, enabling practical, scalable improvements in vision-language pretraining and downstream cross-modal understanding.

Abstract

The effectiveness of Contrastive Language-Image Pre-training (CLIP) models critically depends on the semantic diversity and quality of their training data. However, while existing synthetic data generation methods primarily focus on increasing data volume, such emphasis often leads to limited semantic diversity and redundant or shallow captions. To address this limitation, we propose Role-SynthCLIP, a novel data synthesis framework that leverages multi-perspective role-playing prompts (e.g., a compositional analyst, an interpreter of image context) to guide Multimodal Large Language Models (MLLMs) in generating semantically diverse captions from distinct viewpoints. This mechanism enhances the semantic diversity and fine-grained image-text alignment of synthetic pairs, thereby improving caption expressiveness and accuracy while keeping the total number of image-text pairs unchanged. Experimental results demonstrate the effectiveness and efficiency of our method. A CLIP-B/16 model trained on only 1 million Role-SynthCLIP pairs achieves a Recall@1 of 64.1% on the MS COCO validation set, surpassing the best existing synthetic data baseline (trained on 5M pairs) by 2.8 percentage points. The code and trained models are released at https://github.com/huangfu170/Role-SynthCLIP.

Role-SynthCLIP: A Role Play Driven Diverse Synthetic Data Approach

TL;DR

Role-SynthCLIP addresses the semantic diversity bottleneck in synthetic CLIP data by employing multi-expert role-playing prompts to generate diverse, perspective-aware captions from multimodal large language models. It combines expert-role generation, role-consistent captioning, a GPT-5–driven role-aware filter, and Long-CLIP style long-caption adaptation with a multi-positive contrastive loss to synthesize high-quality image-text pairs without increasing data volume. The approach yields state-of-the-art results in data-efficient CLIP training, exemplified by a CLIP-B/16 model trained on 1 million Role-SynthCLIP pairs achieving Recall@1 on COCO that surpasses larger synthetic baselines, and robust performance on out-of-distribution tasks. These findings demonstrate that controlled, cognitively diverse data generation can outperform sheer data quantity, enabling practical, scalable improvements in vision-language pretraining and downstream cross-modal understanding.

Abstract

The effectiveness of Contrastive Language-Image Pre-training (CLIP) models critically depends on the semantic diversity and quality of their training data. However, while existing synthetic data generation methods primarily focus on increasing data volume, such emphasis often leads to limited semantic diversity and redundant or shallow captions. To address this limitation, we propose Role-SynthCLIP, a novel data synthesis framework that leverages multi-perspective role-playing prompts (e.g., a compositional analyst, an interpreter of image context) to guide Multimodal Large Language Models (MLLMs) in generating semantically diverse captions from distinct viewpoints. This mechanism enhances the semantic diversity and fine-grained image-text alignment of synthetic pairs, thereby improving caption expressiveness and accuracy while keeping the total number of image-text pairs unchanged. Experimental results demonstrate the effectiveness and efficiency of our method. A CLIP-B/16 model trained on only 1 million Role-SynthCLIP pairs achieves a Recall@1 of 64.1% on the MS COCO validation set, surpassing the best existing synthetic data baseline (trained on 5M pairs) by 2.8 percentage points. The code and trained models are released at https://github.com/huangfu170/Role-SynthCLIP.

Paper Structure

This paper contains 24 sections, 4 equations, 5 figures, 10 tables, 1 algorithm.

Figures (5)

  • Figure 1: Comparison of synthetic data generation paradigms. Our proposed Role-SynthCLIP framework addresses the semantic impoverishment of existing methods by leveraging multi-expert role-playing prompts. (Top) Under the conventional paradigm, generic prompts fed into the MLLM yield shallow, single-perspective descriptions. (Bottom) Our approach introduces role-play templates (e.g., Compositional Analyst, Narrative Setter), which guide the MLLM to perceive the image from diverse cognitive perspectives. This mechanism generates semantically rich captions that focus on fine-grained elements (e.g., specific objects, visual structure, context), maximizing the diversity of representations learned during VLM pre-training.
  • Figure 2: Overview of our Role-SynthCLIP framework. We begin by sampling multiple distinct expert roles from a large language model, each with a defined specialty and set of responsibilities. Subsequently, these expert annotators describe an image from their specific professional viewpoint. Finally, we employ a filtering model to remove potential noise from the generated data, ensuring that the descriptions are both accurate to the image content and consistent with the assigned roles.
  • Figure 3: Comparison of text-to-image generation using Long-CLIP and Role-SynthCLIP as text encoders. The prompt describes "a serene lakeside scene with tall trees and an old man sitting on a rock." Role-SynthCLIP better preserves these semantic details, accurately depicting both the human subject and the tall trees.
  • Figure 4: Role-conditioned attention behavior in Role-SynthCLIP. Token Activation Map (TAM) visualizations compare caption token "buildings" produced under three expert roles Narrative or Scene Setter, Emotional Responder, and Compositional Analyst-against the original image.
  • Figure 5: Qualitative comparison of attention maps between CLIP and Role-SynthCLIP. We visualize gradient-based saliency (Grad-CAM) from the text [CLS] token to image patch tokens to compare CLIP and Role-SynthCLIP.