Table of Contents
Fetching ...

Enhancing Vision-Language Compositional Understanding with Multimodal Synthetic Data

Haoxin Li, Boyang Li

TL;DR

SPARCL tackles the gap in vision-language compositional understanding by generating multimodal synthetic data with precise variations and robustly training VLMs on real and synthetic pairs. It combines image feature injection into a fast T2I generator with AdaIN-style style transfer to improve image fidelity, and introduces an adaptive margin loss that differentiates positive, hard negative, and easy negative samples to focus learning on informative cases. Empirical results across four compositional benchmarks show that SPARCL substantially improves CLIP-based models, outperforming several state-of-the-art synthetic-data methods and achieving notable gains on VL-CheckList, SugarCrepe, and SugarCrepe++ with some trade-offs on other tasks. The approach offers a scalable, data-efficient path to stronger compositional reasoning in multimodal models, with practical implications for robust cross-modal understanding in real-world tasks.

Abstract

Paired image-text data with subtle variations in-between (e.g., people holding surfboards vs. people holding shovels) hold the promise of producing Vision-Language Models with proper compositional understanding. Synthesizing such training data from generative models is a highly coveted prize due to the reduced cost of data collection. However, synthesizing training images for compositional learning presents three challenges: (1) efficiency in generating large quantities of images, (2) text alignment between the generated image and the caption in the exact place of the subtle change, and (3) image fidelity in ensuring sufficient similarity with the original real images in all other places. We propose SPARCL (Synthetic Perturbations for Advancing Robust Compositional Learning), which integrates image feature injection into a fast text-to-image generative model, followed by an image style transfer step, to meet the three challenges. Further, to cope with any residual issues of text alignment, we propose an adaptive margin loss to filter out potentially incorrect synthetic samples and focus the learning on informative hard samples. Evaluation on four compositional understanding benchmarks demonstrates that SPARCL significantly improves the compositionality of CLIP, boosting the average accuracy of the CLIP base model by over 8% across all benchmarks and outperforming state-of-the-art methods by 2% on three benchmarks.

Enhancing Vision-Language Compositional Understanding with Multimodal Synthetic Data

TL;DR

SPARCL tackles the gap in vision-language compositional understanding by generating multimodal synthetic data with precise variations and robustly training VLMs on real and synthetic pairs. It combines image feature injection into a fast T2I generator with AdaIN-style style transfer to improve image fidelity, and introduces an adaptive margin loss that differentiates positive, hard negative, and easy negative samples to focus learning on informative cases. Empirical results across four compositional benchmarks show that SPARCL substantially improves CLIP-based models, outperforming several state-of-the-art synthetic-data methods and achieving notable gains on VL-CheckList, SugarCrepe, and SugarCrepe++ with some trade-offs on other tasks. The approach offers a scalable, data-efficient path to stronger compositional reasoning in multimodal models, with practical implications for robust cross-modal understanding in real-world tasks.

Abstract

Paired image-text data with subtle variations in-between (e.g., people holding surfboards vs. people holding shovels) hold the promise of producing Vision-Language Models with proper compositional understanding. Synthesizing such training data from generative models is a highly coveted prize due to the reduced cost of data collection. However, synthesizing training images for compositional learning presents three challenges: (1) efficiency in generating large quantities of images, (2) text alignment between the generated image and the caption in the exact place of the subtle change, and (3) image fidelity in ensuring sufficient similarity with the original real images in all other places. We propose SPARCL (Synthetic Perturbations for Advancing Robust Compositional Learning), which integrates image feature injection into a fast text-to-image generative model, followed by an image style transfer step, to meet the three challenges. Further, to cope with any residual issues of text alignment, we propose an adaptive margin loss to filter out potentially incorrect synthetic samples and focus the learning on informative hard samples. Evaluation on four compositional understanding benchmarks demonstrates that SPARCL significantly improves the compositionality of CLIP, boosting the average accuracy of the CLIP base model by over 8% across all benchmarks and outperforming state-of-the-art methods by 2% on three benchmarks.

Paper Structure

This paper contains 19 sections, 5 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Challenges in generating and training on synthetic data: (a) When generating an image with subtle variations based on a real image and a target caption specifying the variations, an image editing model brooks2023instructpix2pix struggles with text alignment (middle), while an image generation model rombach2022high fails to maintain image fidelity (right). (b) Synthetic positive and negative image-caption pairs show different levels of alignment quality. The subtle variations in the synthetic negative caption (left) make it difficult to distinguish from the positive; the over-modified negative caption (middle) is easy to distinguish; and the hallucinated content in the synthetic positive caption (right) results in an incorrect positive.
  • Figure 2: An overview of SPARCL. (a) Starting with a real image-caption pair, we generate synthetic positive and negative pairs with subtle variations using an LLM and a fast T2I model. To improve the quality of subtle variations in synthetic images, we introduce image feature injection to reduce unintended variations from a standard T2I model (see Sec. \ref{['subsec:feat-injection']}). (b) We train the VLM using both real and synthetic samples. In addition to a sigmoid loss for distinguishing positive and negative pairs, we apply an adaptive margin loss that leverages varying alignment levels across training samples to learn informative nuanced distinctions (see Sec. \ref{['subsec:loss']}).
  • Figure A1: Prompts used to generate negative and positive captions.
  • Figure A2: Examples of synthetic samples from StyleAligned. The algorithm did not alter the image content according to the caption.
  • Figure A3: Examples of synthetic samples without and with image feature injection. In these examples, the image feature injection technique achieves alignment of the subject size and the viewing angle with those in real images.
  • ...and 1 more figures