LLaVA-SG: Leveraging Scene Graphs as Visual Semantic Expression in Vision-Language Models

Jingyi Wang; Jianzhong Ju; Jian Luan; Zhidong Deng

LLaVA-SG: Leveraging Scene Graphs as Visual Semantic Expression in Vision-Language Models

Jingyi Wang, Jianzhong Ju, Jian Luan, Zhidong Deng

TL;DR

An innovative enhancement to address this limitation by introducing a Scene Graph Expression (SGE) module in LVLMs that extracts and structurally expresses the complex semantic information within images, thereby improving the foundational perception and understanding abilities of LVLMs.

Abstract

Recent advances in large vision-language models (VLMs) typically employ vision encoders based on the Vision Transformer (ViT) architecture. The division of the images into patches by ViT results in a fragmented perception, thereby hindering the visual understanding capabilities of VLMs. In this paper, we propose an innovative enhancement to address this limitation by introducing a Scene Graph Expression (SGE) module in VLMs. This module extracts and structurally expresses the complex semantic information within images, thereby improving the foundational perception and understanding abilities of VLMs. Extensive experiments demonstrate that integrating our SGE module significantly enhances the VLM's performance in vision-language tasks, indicating its effectiveness in preserving intricate semantic details and facilitating better visual understanding.

LLaVA-SG: Leveraging Scene Graphs as Visual Semantic Expression in Vision-Language Models

TL;DR

Abstract

Paper Structure (12 sections, 1 equation, 3 figures, 3 tables)

This paper contains 12 sections, 1 equation, 3 figures, 3 tables.

Introduction
Method
Semantic Information Expression
Visual Entity Extraction
Scene Graph Expression
Scene Graph Expression in VLM
Training
Experiments
Experimental Setup
Overall Performance Assessments
Ablation Study
Conclusion

Figures (3)

Figure 1: The illustration of the difference between (a) the baseline method Large Language and Vision Assistant (LLaVA)liu2024llava and (b) our LLaVA-SG model. As a complement to the baseline method of dividing images into patches, our LLaVA-SG leverages scene graphs as the expression of visual semantic within images.
Figure 2: The structure of the proposed Scene Graph Expression module and the LLaVA-SG framework.
Figure 3: Example outputs of LLaVA-1.5 and our LLaVA-SG model with the first case from MMBench and the second case from POPE.

LLaVA-SG: Leveraging Scene Graphs as Visual Semantic Expression in Vision-Language Models

TL;DR

Abstract

LLaVA-SG: Leveraging Scene Graphs as Visual Semantic Expression in Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)