Table of Contents
Fetching ...

Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal Structured Representations

Yufeng Huang, Jiji Tang, Zhuo Chen, Rongsheng Zhang, Xinfeng Zhang, Weijie Chen, Zeng Zhao, Zhou Zhao, Tangjie Lv, Zhipeng Hu, Wen Zhang

TL;DR

The paper tackles the challenge that vision-language models often rely on generic representations and fail to capture fine-grained structure about objects, attributes, and relations. It introduces Structure-CLIP, which combines Scene Graph Knowledge-guided semantic negative sampling with a Knowledge-Enhanced Encoder to inject structured knowledge into multi-modal learning. Empirical results demonstrate state-of-the-art performance on VG-Relation and VG-Attribution while preserving strong general representations on MSCOCO, and ablations confirm the value of semantic negatives and the KEE integration. The work offers a practical path to more semantically precise image-text understanding with potential extensions to knowledge graphs and generation tasks.

Abstract

Large-scale vision-language pre-training has achieved significant performance in multi-modal understanding and generation tasks. However, existing methods often perform poorly on image-text matching tasks that require structured representations, i.e., representations of objects, attributes, and relations. As illustrated in Fig.~reffig:case (a), the models cannot make a distinction between ``An astronaut rides a horse" and ``A horse rides an astronaut". This is because they fail to fully leverage structured knowledge when learning representations in multi-modal scenarios. In this paper, we present an end-to-end framework Structure-CLIP, which integrates Scene Graph Knowledge (SGK) to enhance multi-modal structured representations. Firstly, we use scene graphs to guide the construction of semantic negative examples, which results in an increased emphasis on learning structured representations. Moreover, a Knowledge-Enhance Encoder (KEE) is proposed to leverage SGK as input to further enhance structured representations. To verify the effectiveness of the proposed framework, we pre-train our model with the aforementioned approaches and conduct experiments on downstream tasks. Experimental results demonstrate that Structure-CLIP achieves state-of-the-art (SOTA) performance on VG-Attribution and VG-Relation datasets, with 12.5% and 4.1% ahead of the multi-modal SOTA model respectively. Meanwhile, the results on MSCOCO indicate that Structure-CLIP significantly enhances the structured representations while maintaining the ability of general representations. Our code is available at https://github.com/zjukg/Structure-CLIP.

Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal Structured Representations

TL;DR

The paper tackles the challenge that vision-language models often rely on generic representations and fail to capture fine-grained structure about objects, attributes, and relations. It introduces Structure-CLIP, which combines Scene Graph Knowledge-guided semantic negative sampling with a Knowledge-Enhanced Encoder to inject structured knowledge into multi-modal learning. Empirical results demonstrate state-of-the-art performance on VG-Relation and VG-Attribution while preserving strong general representations on MSCOCO, and ablations confirm the value of semantic negatives and the KEE integration. The work offers a practical path to more semantically precise image-text understanding with potential extensions to knowledge graphs and generation tasks.

Abstract

Large-scale vision-language pre-training has achieved significant performance in multi-modal understanding and generation tasks. However, existing methods often perform poorly on image-text matching tasks that require structured representations, i.e., representations of objects, attributes, and relations. As illustrated in Fig.~reffig:case (a), the models cannot make a distinction between ``An astronaut rides a horse" and ``A horse rides an astronaut". This is because they fail to fully leverage structured knowledge when learning representations in multi-modal scenarios. In this paper, we present an end-to-end framework Structure-CLIP, which integrates Scene Graph Knowledge (SGK) to enhance multi-modal structured representations. Firstly, we use scene graphs to guide the construction of semantic negative examples, which results in an increased emphasis on learning structured representations. Moreover, a Knowledge-Enhance Encoder (KEE) is proposed to leverage SGK as input to further enhance structured representations. To verify the effectiveness of the proposed framework, we pre-train our model with the aforementioned approaches and conduct experiments on downstream tasks. Experimental results demonstrate that Structure-CLIP achieves state-of-the-art (SOTA) performance on VG-Attribution and VG-Relation datasets, with 12.5% and 4.1% ahead of the multi-modal SOTA model respectively. Meanwhile, the results on MSCOCO indicate that Structure-CLIP significantly enhances the structured representations while maintaining the ability of general representations. Our code is available at https://github.com/zjukg/Structure-CLIP.
Paper Structure (38 sections, 14 equations, 5 figures, 5 tables)

This paper contains 38 sections, 14 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: CLIP scores (after normalizing among two results) between the image and aligned/unaligned captions. The results show that the CLIP model does not have the ability to distinguish sentences with structured semantic differences.
  • Figure 2: Overview of Structure-CLIP. (a) Semantic negative sampling via scene graph: we extract a scene graph from the caption to help construct high-quality negative samples(left part). (b)Knowledge-Enhanced Encoder: Knowledge embedding module and multiple Transformers layers are used to model structured knowledge at the input level(right part).
  • Figure 3: Predictions of different approaches. The words in red and blue are two exchanged words. We compare our structure-CLIP with CLIP to calculate CLIP scores (i.e., semantic similarity) between the image and captions.
  • Figure 4: Our method is compared to NegCLIP in a negative sampling scenario. (a) Negative sampling in NegCLIP. We show two situations, one in which two attributes describing the same object are exchanged, and the other in which two less important words are exchanged (e.g., Prepositions, conjunctions). (b) Negative sampling via scene graph generation. From the caption, we get a scene graph from which we can capture the important pair suitable for the exchange.
  • Figure 5: Ablation study of different negative samplings. Our hard negative sampling is more effective than others.