Table of Contents
Fetching ...

Fusing Domain-Specific Content from Large Language Models into Knowledge Graphs for Enhanced Zero Shot Object State Classification

Filippos Gouidis, Katerina Papantoniou, Konstantinos Papoutsakis, Theodore Patkos, Antonis Argyros, Dimitris Plexousakis

TL;DR

This work addresses zero-shot Object State Classification (OSC) by incorporating domain-specific knowledge generated by Large Language Models (LLMs) into a Knowledge Graph (KG) framework, fused with pre-trained semantic embeddings. The authors propose a six-stage pipeline—prompting the LLM, constructing a KG from commonsense sources, producing semantic and visual embeddings, training a Graph Neural Network (GNN) to map semantic to visual space, projecting embeddings into the visual space, and adapting the classifier for zero-shot prediction. Through extensive ablations and comparisons with state-of-the-art baselines, the approach demonstrates significant improvements and achieves state-of-the-art performance on four OSC datasets, highlighting the benefit of combining domain knowledge with general-purpose representations. The results suggest that LLM-driven domain knowledge can robustly augment KG-based zero-shot vision tasks, with promising directions for prompt optimization, LLM fine-tuning on image-text data, and broader applicability to zero-shot problems.

Abstract

Domain-specific knowledge can significantly contribute to addressing a wide variety of vision tasks. However, the generation of such knowledge entails considerable human labor and time costs. This study investigates the potential of Large Language Models (LLMs) in generating and providing domain-specific information through semantic embeddings. To achieve this, an LLM is integrated into a pipeline that utilizes Knowledge Graphs and pre-trained semantic vectors in the context of the Vision-based Zero-shot Object State Classification task. We thoroughly examine the behavior of the LLM through an extensive ablation study. Our findings reveal that the integration of LLM-based embeddings, in combination with general-purpose pre-trained embeddings, leads to substantial performance improvements. Drawing insights from this ablation study, we conduct a comparative analysis against competing models, thereby highlighting the state-of-the-art performance achieved by the proposed approach.

Fusing Domain-Specific Content from Large Language Models into Knowledge Graphs for Enhanced Zero Shot Object State Classification

TL;DR

This work addresses zero-shot Object State Classification (OSC) by incorporating domain-specific knowledge generated by Large Language Models (LLMs) into a Knowledge Graph (KG) framework, fused with pre-trained semantic embeddings. The authors propose a six-stage pipeline—prompting the LLM, constructing a KG from commonsense sources, producing semantic and visual embeddings, training a Graph Neural Network (GNN) to map semantic to visual space, projecting embeddings into the visual space, and adapting the classifier for zero-shot prediction. Through extensive ablations and comparisons with state-of-the-art baselines, the approach demonstrates significant improvements and achieves state-of-the-art performance on four OSC datasets, highlighting the benefit of combining domain knowledge with general-purpose representations. The results suggest that LLM-driven domain knowledge can robustly augment KG-based zero-shot vision tasks, with promising directions for prompt optimization, LLM fine-tuning on image-text data, and broader applicability to zero-shot problems.

Abstract

Domain-specific knowledge can significantly contribute to addressing a wide variety of vision tasks. However, the generation of such knowledge entails considerable human labor and time costs. This study investigates the potential of Large Language Models (LLMs) in generating and providing domain-specific information through semantic embeddings. To achieve this, an LLM is integrated into a pipeline that utilizes Knowledge Graphs and pre-trained semantic vectors in the context of the Vision-based Zero-shot Object State Classification task. We thoroughly examine the behavior of the LLM through an extensive ablation study. Our findings reveal that the integration of LLM-based embeddings, in combination with general-purpose pre-trained embeddings, leads to substantial performance improvements. Drawing insights from this ablation study, we conduct a comparative analysis against competing models, thereby highlighting the state-of-the-art performance achieved by the proposed approach.
Paper Structure (16 sections, 1 figure, 8 tables)

This paper contains 16 sections, 1 figure, 8 tables.

Figures (1)

  • Figure 1: The schematic representation of our methodology. A Large Language Model (LLM) is given specific prompts associated with the target classes we aim to identify. The resulting corpus is processed, leading to the generation of semantic embeddings. Subsequently, these vectors are fed into a Graph Neural Network (GNN), previously trained to map embeddings from the semantic space to the visual space. The resulting visual embeddings are then integrated into the final layer of a pre-trained Convolutional Neural Network (CNN) classifier.