Table of Contents
Fetching ...

Light Up the Shadows: Enhance Long-Tailed Entity Grounding with Concept-Guided Vision-Language Models

Yikai Zhang, Qianyu He, Xintao Wang, Siyu Yuan, Jiaqing Liang, Yanghua Xiao

TL;DR

This work tackles the challenge of grounding long-tailed entities in Multi-Modal Knowledge Graphs by introducing COG, a two-stage framework that guides vision-language models with concepts. By integrating Concept Integration and Evidence Fusion, COG leverages contrastive learning at both entity and concept levels and provides interpretable evidence to support human verification, achieving improved grounding accuracy across multiple PVLM backbones. A 25k image-text dataset of long-tailed entities demonstrates the method’s effectiveness and explainability, with results showing robust gains over traditional grounding approaches. The approach is model-pluggable and emphasizes practical utility for quality control in MMKG construction and downstream tasks.

Abstract

Multi-Modal Knowledge Graphs (MMKGs) have proven valuable for various downstream tasks. However, scaling them up is challenging because building large-scale MMKGs often introduces mismatched images (i.e., noise). Most entities in KGs belong to the long tail, meaning there are few images of them available online. This scarcity makes it difficult to determine whether a found image matches the entity. To address this, we draw on the Triangle of Reference Theory and suggest enhancing vision-language models with concept guidance. Specifically, we introduce COG, a two-stage framework with COncept-Guided vision-language models. The framework comprises a Concept Integration module, which effectively identifies image-text pairs of long-tailed entities, and an Evidence Fusion module, which offers explainability and enables human verification. To demonstrate the effectiveness of COG, we create a dataset of 25k image-text pairs of long-tailed entities. Our comprehensive experiments show that COG not only improves the accuracy of recognizing long-tailed image-text pairs compared to baselines but also offers flexibility and explainability.

Light Up the Shadows: Enhance Long-Tailed Entity Grounding with Concept-Guided Vision-Language Models

TL;DR

This work tackles the challenge of grounding long-tailed entities in Multi-Modal Knowledge Graphs by introducing COG, a two-stage framework that guides vision-language models with concepts. By integrating Concept Integration and Evidence Fusion, COG leverages contrastive learning at both entity and concept levels and provides interpretable evidence to support human verification, achieving improved grounding accuracy across multiple PVLM backbones. A 25k image-text dataset of long-tailed entities demonstrates the method’s effectiveness and explainability, with results showing robust gains over traditional grounding approaches. The approach is model-pluggable and emphasizes practical utility for quality control in MMKG construction and downstream tasks.

Abstract

Multi-Modal Knowledge Graphs (MMKGs) have proven valuable for various downstream tasks. However, scaling them up is challenging because building large-scale MMKGs often introduces mismatched images (i.e., noise). Most entities in KGs belong to the long tail, meaning there are few images of them available online. This scarcity makes it difficult to determine whether a found image matches the entity. To address this, we draw on the Triangle of Reference Theory and suggest enhancing vision-language models with concept guidance. Specifically, we introduce COG, a two-stage framework with COncept-Guided vision-language models. The framework comprises a Concept Integration module, which effectively identifies image-text pairs of long-tailed entities, and an Evidence Fusion module, which offers explainability and enables human verification. To demonstrate the effectiveness of COG, we create a dataset of 25k image-text pairs of long-tailed entities. Our comprehensive experiments show that COG not only improves the accuracy of recognizing long-tailed image-text pairs compared to baselines but also offers flexibility and explainability.
Paper Structure (35 sections, 6 equations, 7 figures, 6 tables)

This paper contains 35 sections, 6 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: We randomly select 100 entities from the large-scale knowledge graph CN-DBpedia xu2017cn and add human annotations. The blue line represents the changes in the entities' viewtimes, which indicates their click frequency. The red dots indicate the number of correctly matched images found in the top 20 search results for each entity, and the red line smooths out these data points.
  • Figure 2: This figure shows that when searching for an entity named Aristoxenus, the search engine returns two images. By applying concepts, we can conclude that the target Aristoxenus refers to a person, not a butterfly.
  • Figure 3: Overview of COG. COG uses contrastive learning on entity and concept levels for model training. At the inference stage, we utilize a two-stage framework with Concept Integration and Evidence Fusion modules. Concept Integration aims for direct prediction of image-text matches using concept guidance, while the Evidence Fusion module reassesses discarded image candidates from Concept Integration, particularly valuable for rare, long-tailed entities.
  • Figure 4: The process of obtaining correct images through short text entity linking.
  • Figure 5: Comparison of using different concepts in our framework. Not Using Concepts represents using only entity names. Using BLC Concepts and Using All Concepts represents using BLC and all concepts respectively.
  • ...and 2 more figures