Enhancing Vision-Language Compositional Understanding with Multimodal Synthetic Data
Haoxin Li, Boyang Li
TL;DR
SPARCL tackles the gap in vision-language compositional understanding by generating multimodal synthetic data with precise variations and robustly training VLMs on real and synthetic pairs. It combines image feature injection into a fast T2I generator with AdaIN-style style transfer to improve image fidelity, and introduces an adaptive margin loss that differentiates positive, hard negative, and easy negative samples to focus learning on informative cases. Empirical results across four compositional benchmarks show that SPARCL substantially improves CLIP-based models, outperforming several state-of-the-art synthetic-data methods and achieving notable gains on VL-CheckList, SugarCrepe, and SugarCrepe++ with some trade-offs on other tasks. The approach offers a scalable, data-efficient path to stronger compositional reasoning in multimodal models, with practical implications for robust cross-modal understanding in real-world tasks.
Abstract
Paired image-text data with subtle variations in-between (e.g., people holding surfboards vs. people holding shovels) hold the promise of producing Vision-Language Models with proper compositional understanding. Synthesizing such training data from generative models is a highly coveted prize due to the reduced cost of data collection. However, synthesizing training images for compositional learning presents three challenges: (1) efficiency in generating large quantities of images, (2) text alignment between the generated image and the caption in the exact place of the subtle change, and (3) image fidelity in ensuring sufficient similarity with the original real images in all other places. We propose SPARCL (Synthetic Perturbations for Advancing Robust Compositional Learning), which integrates image feature injection into a fast text-to-image generative model, followed by an image style transfer step, to meet the three challenges. Further, to cope with any residual issues of text alignment, we propose an adaptive margin loss to filter out potentially incorrect synthetic samples and focus the learning on informative hard samples. Evaluation on four compositional understanding benchmarks demonstrates that SPARCL significantly improves the compositionality of CLIP, boosting the average accuracy of the CLIP base model by over 8% across all benchmarks and outperforming state-of-the-art methods by 2% on three benchmarks.
