Table of Contents
Fetching ...

When and What to Ask: AskBench and Rubric-Guided RLVR for LLM Clarification

Jiale Zhao, Ke Fang, Lu Cheng

TL;DR

AskBench addresses the problem of LLMs failing to ask for clarification on underspecified prompts leading to unsafe or incorrect answers. It proposes an interactive benchmark that converts static QA into multi turn dialogues with explicit rubric checkpoints and a unified judge loop, plus a rubric guided RLVR training recipe. The approach yields improvements in answer correctness and targeted clarification across diverse domains and shows generalization to unseen tasks. The work demonstrates practical impact by enabling safer and more reliable deployments in high stakes settings.

Abstract

Large language models (LLMs) often respond even when prompts omit critical details or include misleading information, leading to hallucinations or reinforced misconceptions. We study how to evaluate and improve LLMs' ability to decide when and what to ask for clarification without sacrificing task performance. We introduce AskBench, an interactive benchmark that converts standard QA pairs into multi-turn interactions with explicit checkpoints. A unified judge loop evaluates final answers and simulates user responses as needed. AskBench covers two settings: AskMind, with intent-deficient queries requiring clarification, and AskOverconfidence, with queries containing false premises that must be identified and corrected. We further propose rubric-guided reinforcement learning with verifier-based rewards (RLVR), which uses structured rubrics to encourage targeted clarification. Experiments show consistent improvements in accuracy, rubric adherence, and interaction efficiency, with strong generalization to unseen domains.

When and What to Ask: AskBench and Rubric-Guided RLVR for LLM Clarification

TL;DR

AskBench addresses the problem of LLMs failing to ask for clarification on underspecified prompts leading to unsafe or incorrect answers. It proposes an interactive benchmark that converts static QA into multi turn dialogues with explicit rubric checkpoints and a unified judge loop, plus a rubric guided RLVR training recipe. The approach yields improvements in answer correctness and targeted clarification across diverse domains and shows generalization to unseen tasks. The work demonstrates practical impact by enabling safer and more reliable deployments in high stakes settings.

Abstract

Large language models (LLMs) often respond even when prompts omit critical details or include misleading information, leading to hallucinations or reinforced misconceptions. We study how to evaluate and improve LLMs' ability to decide when and what to ask for clarification without sacrificing task performance. We introduce AskBench, an interactive benchmark that converts standard QA pairs into multi-turn interactions with explicit checkpoints. A unified judge loop evaluates final answers and simulates user responses as needed. AskBench covers two settings: AskMind, with intent-deficient queries requiring clarification, and AskOverconfidence, with queries containing false premises that must be identified and corrected. We further propose rubric-guided reinforcement learning with verifier-based rewards (RLVR), which uses structured rubrics to encourage targeted clarification. Experiments show consistent improvements in accuracy, rubric adherence, and interaction efficiency, with strong generalization to unseen domains.
Paper Structure (45 sections, 2 equations, 3 figures, 10 tables)

This paper contains 45 sections, 2 equations, 3 figures, 10 tables.

Figures (3)

  • Figure 1: Overview of the data construction pipeline.
  • Figure 2: AskBench evaluation loop. The judge determines whether a reply is a final answer, scores it, or simulates a user follow-up when the assistant asks for clarification.
  • Figure 3: Training data collection. We first build a difficulty-balanced pool via rejection sampling and pass-rate bucketing, then apply the same prompt-based construction procedure to generate query variants with checkpoints and roll out judge-driven dialogues to obtain rubric-annotated training conversations.