SLIP: Structural-aware Language-Image Pretraining for Vision-Language Alignment

Wenbo Lu

SLIP: Structural-aware Language-Image Pretraining for Vision-Language Alignment

Wenbo Lu

TL;DR

SLIP addresses the gap in vision-language pretraining by integrating relational structure into cross-modal alignment. It extends CLIP with modality-specific Graph Attention Networks and a structural contrastive loss that treats graph-connected nodes as positives, enabling relational supervision at scale. The authors release the Multimodal Amazon Product Co-purchase Graph Dataset and demonstrate that SLIP improves cross-modal retrieval and classification, especially in large-batch training and few-shot settings, compared to CLIP. The work highlights the importance of relational context and proposes future directions toward temporal graph refinement to capture evolving relationships. Overall, SLIP offers a scalable, structure-aware paradigm for more coherent and contextually grounded vision-language representations.

Abstract

Vision-Language Pretraining (VLP) has achieved remarkable success across various downstream tasks, but such gains are largely driven by scaling up on training data. Yet, literature methods treat image-text pairs as isolated training examples; this neglects the rich relational structure naturally present in many domains, such as e-commerce product co-purchase graphs and social recommendation networks. Inspired by neuroscientific evidence that human encodes knowledge as relationship cognitive maps, we introduce Structure-aware Language-Image Pretraining (SLIP). SLIP integrates a structural contrastive loss to align modalities while also modeling relationships between neighboring entities in a structured graph. To support this paradigm, we construct a large-scale Amazon Product Co-purchase Multimodal Graph Dataset, enabling structured cross-modality supervision at scale. Experiment results show that SLIP consistently outperforms CLIP on cross-modal retrieval and classification tasks in both zero-shot and few-shot settings, showing the value of relational supervision for cross-modal alignment.

SLIP: Structural-aware Language-Image Pretraining for Vision-Language Alignment

TL;DR

Abstract

SLIP: Structural-aware Language-Image Pretraining for Vision-Language Alignment

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)