Table of Contents
Fetching ...

ComAlign: Compositional Alignment in Vision-Language Models

Ali Abdollah, Amirmohammad Izadi, Armin Saghafian, Reza Vahidimajd, Mohammad Mozafari, Amirreza Mirzaei, Mohammadmahdi Samiei, Mahdieh Soleymani Baghshah

TL;DR

ComAlign, a fine-grained approach to discover more exact correspondence of text and image components using only the weak supervision in the form of image-text pairs, is introduced, training a lightweight network lying on top of existing visual and language encoders using a small dataset.

Abstract

Vision-language models (VLMs) like CLIP have showcased a remarkable ability to extract transferable features for downstream tasks. Nonetheless, the training process of these models is usually based on a coarse-grained contrastive loss between the global embedding of images and texts which may lose the compositional structure of these modalities. Many recent studies have shown VLMs lack compositional understandings like attribute binding and identifying object relationships. Although some recent methods have tried to achieve finer-level alignments, they either are not based on extracting meaningful components of proper granularity or don't properly utilize the modalities' correspondence (especially in image-text pairs with more ingredients). Addressing these limitations, we introduce Compositional Alignment (ComAlign), a fine-grained approach to discover more exact correspondence of text and image components using only the weak supervision in the form of image-text pairs. Our methodology emphasizes that the compositional structure (including entities and relations) extracted from the text modality must also be retained in the image modality. To enforce correspondence of fine-grained concepts in image and text modalities, we train a lightweight network lying on top of existing visual and language encoders using a small dataset. The network is trained to align nodes and edges of the structure across the modalities. Experimental results on various VLMs and datasets demonstrate significant improvements in retrieval and compositional benchmarks, affirming the effectiveness of our plugin model.

ComAlign: Compositional Alignment in Vision-Language Models

TL;DR

ComAlign, a fine-grained approach to discover more exact correspondence of text and image components using only the weak supervision in the form of image-text pairs, is introduced, training a lightweight network lying on top of existing visual and language encoders using a small dataset.

Abstract

Vision-language models (VLMs) like CLIP have showcased a remarkable ability to extract transferable features for downstream tasks. Nonetheless, the training process of these models is usually based on a coarse-grained contrastive loss between the global embedding of images and texts which may lose the compositional structure of these modalities. Many recent studies have shown VLMs lack compositional understandings like attribute binding and identifying object relationships. Although some recent methods have tried to achieve finer-level alignments, they either are not based on extracting meaningful components of proper granularity or don't properly utilize the modalities' correspondence (especially in image-text pairs with more ingredients). Addressing these limitations, we introduce Compositional Alignment (ComAlign), a fine-grained approach to discover more exact correspondence of text and image components using only the weak supervision in the form of image-text pairs. Our methodology emphasizes that the compositional structure (including entities and relations) extracted from the text modality must also be retained in the image modality. To enforce correspondence of fine-grained concepts in image and text modalities, we train a lightweight network lying on top of existing visual and language encoders using a small dataset. The network is trained to align nodes and edges of the structure across the modalities. Experimental results on various VLMs and datasets demonstrate significant improvements in retrieval and compositional benchmarks, affirming the effectiveness of our plugin model.
Paper Structure (18 sections, 11 equations, 7 figures, 6 tables, 1 algorithm)

This paper contains 18 sections, 11 equations, 7 figures, 6 tables, 1 algorithm.

Figures (7)

  • Figure 1: Illustration of how entities and their relationships are considered components of the image and text. The textual modality contains entities and relationships shown as nodes and edges (i.e., actions) along with their two corresponding nodes (i.e., subject and object), respectively. The visual modality also mirrors this to provide a structure for better alignment of the modalities.
  • Figure 2: Overview of the proposed method. Given a batch of image-text pairs, each image and text is preprocessed by object-detector and NLP tools to extract entity and relational components. These components, along with the original image and text, are processed by a base VLM to obtain visual and textual representations. These are then passed through our ComAlign image and text encoders. We calculate the similarity score between an image and a text using three metrics: 1) Coarse-grained similarity: Calculated as the dot product of the global features of the image and text. 2) Fine-grained entity-based similarities: The entity similarity matrix is obtained by calculating the cosine similarity between each pair of the visual entity representation (VR) and textual entity representation (TR). 3) Fine-grained relation-based similarities: Similarly, the relation similarity matrix is computed according to the cosine similarity of all pairs of visual and textual relation representations. By employing Fine-Grained Matching on the obtained matrices, the whole entity-based similarity and relation-based similarity between the image and text are found (for both Text2Image and Image2Text directions). These fine-grained similarities are used in the contrastive training and inference process.
  • Figure 3: Illustration of the process of calculating Image-to-Text (I2T) and Text-to-Image (T2I) similarity, including global, entity, and relational components.
  • Figure 4: Illustration of relational component similarity matrices. Left: CLIP-ViT-B/32, Right: ComAlign (Ours).
  • Figure 5: Illustration of entity component similarity matrices. Left: CLIP-ViT-B/32, Right: ComAlign (Ours).
  • ...and 2 more figures