Table of Contents
Fetching ...

Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models

Sivan Doveh, Assaf Arbelle, Sivan Harary, Roei Herzig, Donghyun Kim, Paola Cascante-bonilla, Amit Alfassy, Rameswar Panda, Raja Giryes, Rogerio Feris, Shimon Ullman, Leonid Karlinsky

TL;DR

Vision-language models suffer from poor compositional reasoning due to loose image-text alignment. The authors introduce Dense and Aligned Captions (DAC), a data-centric fine-tuning pipeline that improves caption quality and density through BLIP2, LLM expansion, and semantic segmentation, coupled with MIL-based learning, negative text augmentation, and LoRA. Applied to CLIP on CC3M, DAC yields substantial gains on compositional benchmarks (up to 27% in inter-object relations and 6.7% average) while preserving linear transfer, and can exploit unlabeled images. This approach reduces reliance on costly annotations and enhances robust multimodal understanding in practical systems.

Abstract

Vision and Language (VL) models offer an effective method for aligning representation spaces of images and text, leading to numerous applications such as cross-modal retrieval, visual question answering, captioning, and more. However, the aligned image-text spaces learned by all the popular VL models are still suffering from the so-called `object bias' - their representations behave as `bags of nouns', mostly ignoring or downsizing the attributes, relations, and states of objects described/appearing in texts/images. Although some great attempts at fixing these `compositional reasoning' issues were proposed in the recent literature, the problem is still far from being solved. In this paper, we uncover two factors limiting the VL models' compositional reasoning performance. These two factors are properties of the paired VL dataset used for finetuning and pre-training the VL model: (i) the caption quality, or in other words `image-alignment', of the texts; and (ii) the `density' of the captions in the sense of mentioning all the details appearing on the image. We propose a fine-tuning approach for automatically treating these factors leveraging a standard VL dataset (CC3M). Applied to CLIP, we demonstrate its significant compositional reasoning performance increase of up to $\sim27\%$ over the base model, up to $\sim20\%$ over the strongest baseline, and by $6.7\%$ on average.

Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models

TL;DR

Vision-language models suffer from poor compositional reasoning due to loose image-text alignment. The authors introduce Dense and Aligned Captions (DAC), a data-centric fine-tuning pipeline that improves caption quality and density through BLIP2, LLM expansion, and semantic segmentation, coupled with MIL-based learning, negative text augmentation, and LoRA. Applied to CLIP on CC3M, DAC yields substantial gains on compositional benchmarks (up to 27% in inter-object relations and 6.7% average) while preserving linear transfer, and can exploit unlabeled images. This approach reduces reliance on costly annotations and enhances robust multimodal understanding in practical systems.

Abstract

Vision and Language (VL) models offer an effective method for aligning representation spaces of images and text, leading to numerous applications such as cross-modal retrieval, visual question answering, captioning, and more. However, the aligned image-text spaces learned by all the popular VL models are still suffering from the so-called `object bias' - their representations behave as `bags of nouns', mostly ignoring or downsizing the attributes, relations, and states of objects described/appearing in texts/images. Although some great attempts at fixing these `compositional reasoning' issues were proposed in the recent literature, the problem is still far from being solved. In this paper, we uncover two factors limiting the VL models' compositional reasoning performance. These two factors are properties of the paired VL dataset used for finetuning and pre-training the VL model: (i) the caption quality, or in other words `image-alignment', of the texts; and (ii) the `density' of the captions in the sense of mentioning all the details appearing on the image. We propose a fine-tuning approach for automatically treating these factors leveraging a standard VL dataset (CC3M). Applied to CLIP, we demonstrate its significant compositional reasoning performance increase of up to over the base model, up to over the strongest baseline, and by on average.
Paper Structure (16 sections, 6 equations, 3 figures, 4 tables)

This paper contains 16 sections, 6 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: (a) Current VL models struggle with compositional aspects. In this case, both LLaVA and MiniGPT-4 which are extremely strong combinations of VL models with LLMs, misunderstand some basic spatial relations. (b) We are able to improve VL models' compositional reasoning with our proposed fine-tuning approach including improving caption quality and increasing caption density.
  • Figure 2: Detailed flow of our method: (i) The image is first captioned using the strong Captioner to create a high-quality caption. (ii) Then two methods for caption density enhancement (expansion) are applied: the "LLM-Expander" and the "SAM-Expander". Both create additional captions which relate to the image. (iii) the negative generator is applied to all captions, including the expanded ones. (iv) The image and all the captions are encoded using their respective encoders. (v) Finally, three losses are applied to the matching between the visual and textual embeddings.
  • Figure 3: (a) The distribution of the CLIP score as a proxy of the caption quality. We see that the improved captions generally have higher scores than the original captions. (b) Examples of images with their original and generated captions. The captioner (blip2) can also make compositional reasoning mistakes (e.g. the zebra image), but our DAC can compensate through the proposed caption density expansion, as also evident from the quantitative evaluation advantage DAC has over blip2 in Tab. \ref{['tab:main_res']}. (c) Analysis of the VL-Checklist results with respect to the percent of captions replaced with our higher quality ones. We see a clear trend favoring quality captions. (d-e) The effects of improving caption density using LLM and SAM methods respectively, by increasing the percent of sampled captions, and thus the density, we see significant improvements.