Table of Contents
Fetching ...

LLaVA-SG: Leveraging Scene Graphs as Visual Semantic Expression in Vision-Language Models

Jingyi Wang, Jianzhong Ju, Jian Luan, Zhidong Deng

TL;DR

An innovative enhancement to address this limitation by introducing a Scene Graph Expression (SGE) module in LVLMs that extracts and structurally expresses the complex semantic information within images, thereby improving the foundational perception and understanding abilities of LVLMs.

Abstract

Recent advances in large vision-language models (VLMs) typically employ vision encoders based on the Vision Transformer (ViT) architecture. The division of the images into patches by ViT results in a fragmented perception, thereby hindering the visual understanding capabilities of VLMs. In this paper, we propose an innovative enhancement to address this limitation by introducing a Scene Graph Expression (SGE) module in VLMs. This module extracts and structurally expresses the complex semantic information within images, thereby improving the foundational perception and understanding abilities of VLMs. Extensive experiments demonstrate that integrating our SGE module significantly enhances the VLM's performance in vision-language tasks, indicating its effectiveness in preserving intricate semantic details and facilitating better visual understanding.

LLaVA-SG: Leveraging Scene Graphs as Visual Semantic Expression in Vision-Language Models

TL;DR

An innovative enhancement to address this limitation by introducing a Scene Graph Expression (SGE) module in LVLMs that extracts and structurally expresses the complex semantic information within images, thereby improving the foundational perception and understanding abilities of LVLMs.

Abstract

Recent advances in large vision-language models (VLMs) typically employ vision encoders based on the Vision Transformer (ViT) architecture. The division of the images into patches by ViT results in a fragmented perception, thereby hindering the visual understanding capabilities of VLMs. In this paper, we propose an innovative enhancement to address this limitation by introducing a Scene Graph Expression (SGE) module in VLMs. This module extracts and structurally expresses the complex semantic information within images, thereby improving the foundational perception and understanding abilities of VLMs. Extensive experiments demonstrate that integrating our SGE module significantly enhances the VLM's performance in vision-language tasks, indicating its effectiveness in preserving intricate semantic details and facilitating better visual understanding.
Paper Structure (12 sections, 1 equation, 3 figures, 3 tables)

This paper contains 12 sections, 1 equation, 3 figures, 3 tables.

Figures (3)

  • Figure 1: The illustration of the difference between (a) the baseline method Large Language and Vision Assistant (LLaVA)liu2024llava and (b) our LLaVA-SG model. As a complement to the baseline method of dividing images into patches, our LLaVA-SG leverages scene graphs as the expression of visual semantic within images.
  • Figure 2: The structure of the proposed Scene Graph Expression module and the LLaVA-SG framework.
  • Figure 3: Example outputs of LLaVA-1.5 and our LLaVA-SG model with the first case from MMBench and the second case from POPE.