Fusing Domain-Specific Content from Large Language Models into Knowledge Graphs for Enhanced Zero Shot Object State Classification
Filippos Gouidis, Katerina Papantoniou, Konstantinos Papoutsakis, Theodore Patkos, Antonis Argyros, Dimitris Plexousakis
TL;DR
This work addresses zero-shot Object State Classification (OSC) by incorporating domain-specific knowledge generated by Large Language Models (LLMs) into a Knowledge Graph (KG) framework, fused with pre-trained semantic embeddings. The authors propose a six-stage pipeline—prompting the LLM, constructing a KG from commonsense sources, producing semantic and visual embeddings, training a Graph Neural Network (GNN) to map semantic to visual space, projecting embeddings into the visual space, and adapting the classifier for zero-shot prediction. Through extensive ablations and comparisons with state-of-the-art baselines, the approach demonstrates significant improvements and achieves state-of-the-art performance on four OSC datasets, highlighting the benefit of combining domain knowledge with general-purpose representations. The results suggest that LLM-driven domain knowledge can robustly augment KG-based zero-shot vision tasks, with promising directions for prompt optimization, LLM fine-tuning on image-text data, and broader applicability to zero-shot problems.
Abstract
Domain-specific knowledge can significantly contribute to addressing a wide variety of vision tasks. However, the generation of such knowledge entails considerable human labor and time costs. This study investigates the potential of Large Language Models (LLMs) in generating and providing domain-specific information through semantic embeddings. To achieve this, an LLM is integrated into a pipeline that utilizes Knowledge Graphs and pre-trained semantic vectors in the context of the Vision-based Zero-shot Object State Classification task. We thoroughly examine the behavior of the LLM through an extensive ablation study. Our findings reveal that the integration of LLM-based embeddings, in combination with general-purpose pre-trained embeddings, leads to substantial performance improvements. Drawing insights from this ablation study, we conduct a comparative analysis against competing models, thereby highlighting the state-of-the-art performance achieved by the proposed approach.
