Table of Contents
Fetching ...

Knowledge Graph Guided Evaluation of Abstention Techniques

Kinshuk Vasisht, Navreet Kaur, Danish Pruthi

TL;DR

SELECT introduces a knowledge-graph grounded benchmark to evaluate abstention techniques in language models using benign concepts to isolate safety-training effects. By benchmarking prompting, activation steering, and fine-tuning across six models, the paper reveals high abstention rates but weaker generalization to descendants and clear generalization-specificity trade-offs, with no single technique dominating. The findings guide practitioners in selecting strategies based on concept granularity and compute constraints, and point to robustness and deployment challenges such as adversarial perturbations and multilingual contexts as directions for future work. Overall, the study highlights nuanced trade-offs between effectiveness, generalization, and specificity in abstention and provides a scalable framework for evaluating underlying abstention techniques.

Abstract

To deploy language models safely, it is crucial that they abstain from responding to inappropriate requests. Several prior studies test the safety promises of models based on their effectiveness in blocking malicious requests. In this work, we focus on evaluating the underlying techniques that cause models to abstain. We create SELECT, a benchmark derived from a set of benign concepts (e.g., "rivers") from a knowledge graph. Focusing on benign concepts isolates the effect of safety training, and grounding these concepts in a knowledge graph allows us to study the generalization and specificity of abstention techniques. Using SELECT, we benchmark different abstention techniques over six open-weight and closed-source models. We find that the examined techniques indeed cause models to abstain with over $80\%$ abstention rates. However, these techniques are not as effective for descendants of the target concepts, where abstention rates drop by $19\%$. We also characterize the generalization-specificity trade-offs for different techniques. Overall, no single technique is invariably better than others, and our findings inform practitioners of the various trade-offs involved.

Knowledge Graph Guided Evaluation of Abstention Techniques

TL;DR

SELECT introduces a knowledge-graph grounded benchmark to evaluate abstention techniques in language models using benign concepts to isolate safety-training effects. By benchmarking prompting, activation steering, and fine-tuning across six models, the paper reveals high abstention rates but weaker generalization to descendants and clear generalization-specificity trade-offs, with no single technique dominating. The findings guide practitioners in selecting strategies based on concept granularity and compute constraints, and point to robustness and deployment challenges such as adversarial perturbations and multilingual contexts as directions for future work. Overall, the study highlights nuanced trade-offs between effectiveness, generalization, and specificity in abstention and provides a scalable framework for evaluating underlying abstention techniques.

Abstract

To deploy language models safely, it is crucial that they abstain from responding to inappropriate requests. Several prior studies test the safety promises of models based on their effectiveness in blocking malicious requests. In this work, we focus on evaluating the underlying techniques that cause models to abstain. We create SELECT, a benchmark derived from a set of benign concepts (e.g., "rivers") from a knowledge graph. Focusing on benign concepts isolates the effect of safety training, and grounding these concepts in a knowledge graph allows us to study the generalization and specificity of abstention techniques. Using SELECT, we benchmark different abstention techniques over six open-weight and closed-source models. We find that the examined techniques indeed cause models to abstain with over abstention rates. However, these techniques are not as effective for descendants of the target concepts, where abstention rates drop by . We also characterize the generalization-specificity trade-offs for different techniques. Overall, no single technique is invariably better than others, and our findings inform practitioners of the various trade-offs involved.

Paper Structure

This paper contains 36 sections, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Leveraging knowledge graphs to evaluate abstention techniques. Ideally, abstaining from a concept should imply abstention for descendants (generalization) but not ancestor or sibling concepts (specificity).
  • Figure 2: Abstention rates at increasing distances from the target concept for LLaMA 3.1. For inference methods, abstention rates decrease at higher path distances.
  • Figure 3: Trends in evaluation metrics across different levels of the taxonomy for LLaMA-3.1 8B. In general, different abstention techniques follow similar trends, with abstention rates being better for lower levels (more specific concepts), while specificity decreases with increasing levels.
  • Figure 4: Phrases used for classifying responses.
  • Figure 5: Variations across evaluation metrics: abstention rate, generalization and specificity, for different models across taxonomy levels. Different abstention techniques show similar trends across metrics.
  • ...and 3 more figures

Theorems & Definitions (3)

  • Definition 1: Abstention Rate
  • Definition 2: Generalization
  • Definition 3: Specificity