Table of Contents
Fetching ...

EPLKG: Efficient Prompt Learning with Knowledge Graph

YongTaek Lim, Suho Kang, Yewon Kim, Dokyung Yoon, KyungWoo Song

TL;DR

EPLKG tackles the expensive adaptation of large multimodal models by grounding prompts in a knowledge graph and augmenting coverage with LLM-generated visual descriptions. It uses cached CLIP embeddings and a Gumbel-Softmax-based prompt selector to efficiently pick a single, human-interpretable prompt per image-class pair, avoiding backprop through the encoders. The approach yields substantial time and memory savings while maintaining competitive accuracy across 11 benchmarks and multiple generalization settings. Grad-CAM analyses corroborate improved interpretability, showing prompts that align with discriminative visual features. Overall, EPLKG offers a practical, scalable pathway for efficient, interpretable prompt learning in vision–language models.

Abstract

Large-scale pre-trained models such as CLIP excel in transferability and robust generalization across diverse datasets. However, adapting these models to new datasets or domains is computationally costly, especially in low-resource or few-shot settings, and existing prompt-learning methods often lack interpretability. We introduce Efficient Prompt Learning with Knowledge Graph (EPLKG), which uses a knowledge graph to curate diverse, interpretable prompts and, where KG coverage is limited, augments this bank with LLM-generated human-readable visual descriptions. EPLKG operates entirely on cached CLIP image and text embeddings and employs a lightweight Gumbel-Softmax module to select a single prompt per image-class pair, enabling low-memory, fast training. Across 11 benchmarks, EPLKG reduces per-image training time by up to 45 percent and peak GPU memory by around 30 to 40 percent compared to strong prompt-learning baselines, while keeping the average base-new harmonic-mean accuracy within 2 percentage points, thereby improving the efficiency of model adaptation without sacrificing competitive performance or interpretability.

EPLKG: Efficient Prompt Learning with Knowledge Graph

TL;DR

EPLKG tackles the expensive adaptation of large multimodal models by grounding prompts in a knowledge graph and augmenting coverage with LLM-generated visual descriptions. It uses cached CLIP embeddings and a Gumbel-Softmax-based prompt selector to efficiently pick a single, human-interpretable prompt per image-class pair, avoiding backprop through the encoders. The approach yields substantial time and memory savings while maintaining competitive accuracy across 11 benchmarks and multiple generalization settings. Grad-CAM analyses corroborate improved interpretability, showing prompts that align with discriminative visual features. Overall, EPLKG offers a practical, scalable pathway for efficient, interpretable prompt learning in vision–language models.

Abstract

Large-scale pre-trained models such as CLIP excel in transferability and robust generalization across diverse datasets. However, adapting these models to new datasets or domains is computationally costly, especially in low-resource or few-shot settings, and existing prompt-learning methods often lack interpretability. We introduce Efficient Prompt Learning with Knowledge Graph (EPLKG), which uses a knowledge graph to curate diverse, interpretable prompts and, where KG coverage is limited, augments this bank with LLM-generated human-readable visual descriptions. EPLKG operates entirely on cached CLIP image and text embeddings and employs a lightweight Gumbel-Softmax module to select a single prompt per image-class pair, enabling low-memory, fast training. Across 11 benchmarks, EPLKG reduces per-image training time by up to 45 percent and peak GPU memory by around 30 to 40 percent compared to strong prompt-learning baselines, while keeping the average base-new harmonic-mean accuracy within 2 percentage points, thereby improving the efficiency of model adaptation without sacrificing competitive performance or interpretability.
Paper Structure (17 sections, 2 equations, 4 figures, 4 tables)

This paper contains 17 sections, 2 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: The motivation behind EPLKG is to leverage a knowledge graph to extract semantically relevant information for each class, forming triplets that capture plausible attributes of class labels and generating interpretable prompts from these structured relations. The resulting prompt enhances both the model’s classification performance and interpretability, surpassing that of conventional zero-shot prompts.
  • Figure 2: Overview of EPLKG. For each class, KG triplets are converted into textual prompts and embedded with a frozen text encoder, while the input image is encoded with a frozen image encoder. EPLKG applies Gumbel-Softmax over CLIP image–text cosine similarities to select a compatible prompt for classification. Only the selection module is trained on cached embeddings, without backpropagation through the CLIP encoders, resulting in lower memory and compute cost and faster training.
  • Figure 3: The first subplot shows the average performance over 11 datasets; the others show EPLKG and baseline performance per dataset for $k \in \{1,2,4,8,16\}$ shots.
  • Figure 4: Each row presents Grad-CAM visualizations for zero-shot (ZS), CoOp, CoCoOp, KgCoOp and EPLKG (ours). Red regions indicate higher Grad-CAM relevance scores for the target class. In cases where the baseline models fail to make correct predictions, EPLKG succeeds by selecting an optimal prompt from the KG.