Enhancing Large Language Models Against Inductive Instructions with Dual-critique Prompting

Rui Wang; Hongru Wang; Fei Mi; Yi Chen; Boyang Xue; Kam-Fai Wong; Ruifeng Xu

Enhancing Large Language Models Against Inductive Instructions with Dual-critique Prompting

Rui Wang, Hongru Wang, Fei Mi, Yi Chen, Boyang Xue, Kam-Fai Wong, Ruifeng Xu

TL;DR

This work introduces INDust, a challenging benchmark for probing how large language models handle inductive instructions that embed counterfactual premises, demonstrating widespread vulnerability and the influence of instruction style. It categorizes inductive prompts into Fact-Checking Instructions (FCI), Questions based on False Premises (QFP), and Creative Instructions based on False Premises (CIFP), including single- and multi-premise variants, and provides a data collection and evaluation framework with human and automatic scoring. To bolster robustness, the authors propose Dual-critique prompting, comprising User-critique and Self-critique components, and show consistent improvements across multiple models in zero-shot and few-shot settings, with SDual-critique generally preferred for practicality. They further explore finetuning on an expanded LINDust dataset, illustrating substantial gains for BELLE-7B and highlighting practical implications for deploying safer, more truthful LLMs. Overall, the work offers a scalable, training-free defense against inductive instructions and provides a foundation for future data-driven and prompting-based robustness enhancements in LLM alignment.

Abstract

Numerous works are proposed to align large language models (LLMs) with human intents to better fulfill instructions, ensuring they are trustful and helpful. Nevertheless, some human instructions are often malicious or misleading and following them will lead to untruthful and unsafe responses. Previous work rarely focused on understanding how LLMs manage instructions based on counterfactual premises, referred to here as \textit{inductive instructions}, which may stem from users' false beliefs or malicious intents. In this paper, we aim to reveal the behaviors of LLMs towards \textit{inductive instructions} and enhance their truthfulness and helpfulness accordingly. Specifically, we first introduce a benchmark of \underline{\textbf{Indu}}ctive {In\underline{\textbf{st}}ruct}ions (\textsc{\textbf{INDust}}), where the false knowledge is incorporated into instructions in multiple different styles. After extensive human and automatic evaluations, we uncovered a universal vulnerability among LLMs in processing inductive instructions. Additionally, we identified that different inductive styles affect the models' ability to identify the same underlying errors, and the complexity of the underlying assumptions also influences the model's performance. Motivated by these results, we propose \textsc{Dual-critique} prompting to improve LLM robustness against inductive instructions. Our experiments demonstrate that \textsc{Dual-critique} prompting significantly bolsters the robustness of a diverse array of LLMs, even when confronted with varying degrees of inductive instruction complexity and differing inductive styles.

Enhancing Large Language Models Against Inductive Instructions with Dual-critique Prompting

TL;DR

Abstract

Paper Structure (55 sections, 6 figures, 16 tables)

This paper contains 55 sections, 6 figures, 16 tables.

Introduction
Catagories of Inductive Instructions
Data Collection
False Knowledge Collection
Collecting from Rumor Datasets
Removal of Obscure Knowledge
Rewriting False Knowledge
FCI
QFP and CIFP
Reference Response Collection
Quality Control
Statistics of INDust
Fragility of LLMs Against INDust
Models
Evaluation Settings
...and 40 more sections

Figures (6)

Figure 1: Depiction of INDust dataset samples and Dual-critique prompting technique. Displayed are six representative samples from different inductive instruction categories. The figure contrasts Standard prompting against the Dual-critique for processing inductive instructions. The Dual-critique method encompasses two distinct components: the User-critique and the Self-critique.
Figure 2: The data collection procedure, including (1)False Knowledge Collection, (2) Rewriting False Knowledge, and (3) Reference Response Collection. MP means "multiple premises".
Figure 3: The performance of models on the INDust evaluated by GPT-4. Our analysis leads us to two key insights: firstly, the performance of LLMs is notably affected by variations in inductive styles when processing the same knowledge. Second, these models display a weak tendency to identify and correct the false premise, with three out of the four models unable to attain an average Helpfulness score of 1 when evaluated on both the QFP and CIFP.
Figure 4: Zero-shot vs. fine-tuned performance with Standard prompting. Opaque bars represent zero-shot, while translucent bars show fine-tuning results.
Figure 5: Performance of LLMs prompted with different versions of SDual-critique instructions. The x-axis represents different prompt versions, while the y-axis represents the model performance. SDual-C. represents SDual-critique.
...and 1 more figures

Enhancing Large Language Models Against Inductive Instructions with Dual-critique Prompting

TL;DR

Abstract

Enhancing Large Language Models Against Inductive Instructions with Dual-critique Prompting

Authors

TL;DR

Abstract

Table of Contents

Figures (6)