Knowledge Graph Guided Evaluation of Abstention Techniques
Kinshuk Vasisht, Navreet Kaur, Danish Pruthi
TL;DR
SELECT introduces a knowledge-graph grounded benchmark to evaluate abstention techniques in language models using benign concepts to isolate safety-training effects. By benchmarking prompting, activation steering, and fine-tuning across six models, the paper reveals high abstention rates but weaker generalization to descendants and clear generalization-specificity trade-offs, with no single technique dominating. The findings guide practitioners in selecting strategies based on concept granularity and compute constraints, and point to robustness and deployment challenges such as adversarial perturbations and multilingual contexts as directions for future work. Overall, the study highlights nuanced trade-offs between effectiveness, generalization, and specificity in abstention and provides a scalable framework for evaluating underlying abstention techniques.
Abstract
To deploy language models safely, it is crucial that they abstain from responding to inappropriate requests. Several prior studies test the safety promises of models based on their effectiveness in blocking malicious requests. In this work, we focus on evaluating the underlying techniques that cause models to abstain. We create SELECT, a benchmark derived from a set of benign concepts (e.g., "rivers") from a knowledge graph. Focusing on benign concepts isolates the effect of safety training, and grounding these concepts in a knowledge graph allows us to study the generalization and specificity of abstention techniques. Using SELECT, we benchmark different abstention techniques over six open-weight and closed-source models. We find that the examined techniques indeed cause models to abstain with over $80\%$ abstention rates. However, these techniques are not as effective for descendants of the target concepts, where abstention rates drop by $19\%$. We also characterize the generalization-specificity trade-offs for different techniques. Overall, no single technique is invariably better than others, and our findings inform practitioners of the various trade-offs involved.
