Table of Contents
Fetching ...

SLIP: Structural-aware Language-Image Pretraining for Vision-Language Alignment

Wenbo Lu

TL;DR

SLIP addresses the gap in vision-language pretraining by integrating relational structure into cross-modal alignment. It extends CLIP with modality-specific Graph Attention Networks and a structural contrastive loss that treats graph-connected nodes as positives, enabling relational supervision at scale. The authors release the Multimodal Amazon Product Co-purchase Graph Dataset and demonstrate that SLIP improves cross-modal retrieval and classification, especially in large-batch training and few-shot settings, compared to CLIP. The work highlights the importance of relational context and proposes future directions toward temporal graph refinement to capture evolving relationships. Overall, SLIP offers a scalable, structure-aware paradigm for more coherent and contextually grounded vision-language representations.

Abstract

Vision-Language Pretraining (VLP) has achieved remarkable success across various downstream tasks, but such gains are largely driven by scaling up on training data. Yet, literature methods treat image-text pairs as isolated training examples; this neglects the rich relational structure naturally present in many domains, such as e-commerce product co-purchase graphs and social recommendation networks. Inspired by neuroscientific evidence that human encodes knowledge as relationship cognitive maps, we introduce Structure-aware Language-Image Pretraining (SLIP). SLIP integrates a structural contrastive loss to align modalities while also modeling relationships between neighboring entities in a structured graph. To support this paradigm, we construct a large-scale Amazon Product Co-purchase Multimodal Graph Dataset, enabling structured cross-modality supervision at scale. Experiment results show that SLIP consistently outperforms CLIP on cross-modal retrieval and classification tasks in both zero-shot and few-shot settings, showing the value of relational supervision for cross-modal alignment.

SLIP: Structural-aware Language-Image Pretraining for Vision-Language Alignment

TL;DR

SLIP addresses the gap in vision-language pretraining by integrating relational structure into cross-modal alignment. It extends CLIP with modality-specific Graph Attention Networks and a structural contrastive loss that treats graph-connected nodes as positives, enabling relational supervision at scale. The authors release the Multimodal Amazon Product Co-purchase Graph Dataset and demonstrate that SLIP improves cross-modal retrieval and classification, especially in large-batch training and few-shot settings, compared to CLIP. The work highlights the importance of relational context and proposes future directions toward temporal graph refinement to capture evolving relationships. Overall, SLIP offers a scalable, structure-aware paradigm for more coherent and contextually grounded vision-language representations.

Abstract

Vision-Language Pretraining (VLP) has achieved remarkable success across various downstream tasks, but such gains are largely driven by scaling up on training data. Yet, literature methods treat image-text pairs as isolated training examples; this neglects the rich relational structure naturally present in many domains, such as e-commerce product co-purchase graphs and social recommendation networks. Inspired by neuroscientific evidence that human encodes knowledge as relationship cognitive maps, we introduce Structure-aware Language-Image Pretraining (SLIP). SLIP integrates a structural contrastive loss to align modalities while also modeling relationships between neighboring entities in a structured graph. To support this paradigm, we construct a large-scale Amazon Product Co-purchase Multimodal Graph Dataset, enabling structured cross-modality supervision at scale. Experiment results show that SLIP consistently outperforms CLIP on cross-modal retrieval and classification tasks in both zero-shot and few-shot settings, showing the value of relational supervision for cross-modal alignment.

Paper Structure

This paper contains 30 sections, 16 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Illustration of contrastive vision-language pretraining using InfoNCE loss. An Amazon product (e.g., a laptop) is paired with its associated textual description to form positive image-text pairs. The model is trained to align these pairs while pushing apart negative pairs sampled from the batch. The image and text are independently encoded via modality-specific encoders before similarity computation.
  • Figure 2: SLIP training pipeline. A mini-batch of products is first sampled as a sub-graph from the Amazon co-purchase network (bottom left). Images undergo standard data augmentation, while titles/descriptions are tokenized. Both modalities are encoded by a CLIP backbone. Top path. The image–text token similarities form the usual InfoNCE matrix: the diagonal (green) contains true pairs, off-diagonal cells (grey) act as negatives. Bottom path. The sampled sub-graph is converteFd to an $n$ -hop adjacency mask that selects nodes within one hop as additional positives (purple) and masks the rest (light grey). Image and text features are concatenated, passed through two layers of graph attention, and projected to node embeddings. Applying the mask to their similarity matrix yields the structural contrastive loss.
  • Figure 3: A 10-node subgraph sampled from the curated dataset (Electronics)
  • Figure 4: Qualitative retrieval comparison for the query title "Garfrfin Delorme Atlas & Gazetteer Paper Maps - Alaska, AA-000004-000". The top row shows the top ten image results from CLIP fine-tuned without graph supervision (w/o graph), and the bottom row shows the corresponding results from SLIP with graph supervision (w/ graph). True matches are highlighted with a colored border and annotated with their retrieval rank.
  • Figure 5: Cosine similarity distributions between cross-modal embeddings at different graph hop distances. We show the density estimates of cosine similarity between image and text embeddings, grouped by hop distance in the product co-purchase graph: 0-hop (self), 1-hop (direct neighbor), 2-hop, and 3-hop.