Table of Contents
Fetching ...

Ada-RS: Adaptive Rejection Sampling for Selective Thinking

Yirou Ge, Yixi Li, Alec Chiu, Shivani Shekhar, Zijie Pan, Avinash Thangali, Yun-Shiuan Chuang, Chaitanya Kulkarni, Uma Kona, Linsey Pang, Prakhar Mehrotra

TL;DR

This work tackles latency- and cost-sensitive deployment of reasoning-enabled LLMs by promoting selective thinking. It introduces Adaptive Rejection Sampling (Ada-RS), an algorithm-agnostic framework that filters training samples using an adaptive, efficiency-aware reward to downweight verbose reasoning while preserving necessary thinking on harder inputs. Ada-RS can plug into both DPO (Ada-RS-DPO) and grouped-policy optimization (Ada-RS-DAPO), yielding substantial improvements on an e-commerce tool-calling benchmark: up to 70–80% reduction in output tokens and up to 95% reduction in thinking rate without sacrificing tool-call accuracy. The approach demonstrates that training-signal construction and selective sample filtering are powerful levers for efficient reasoning, offering practical benefits for latency-constrained AI systems.

Abstract

Large language models (LLMs) are increasingly being deployed in cost and latency-sensitive settings. While chain-of-thought improves reasoning, it can waste tokens on simple requests. We study selective thinking for tool-using LLMs and introduce Adaptive Rejection Sampling (Ada-RS), an algorithm-agnostic sample filtering framework for learning selective and efficient reasoning. For each given context, Ada-RS scores multiple sampled completions with an adaptive length-penalized reward then applies stochastic rejection sampling to retain only high-reward candidates (or preference pairs) for downstream optimization. We demonstrate how Ada-RS plugs into both preference pair (e.g. DPO) or grouped policy optimization strategies (e.g. DAPO). Using Qwen3-8B with LoRA on a synthetic tool call-oriented e-commerce benchmark, Ada-RS improves the accuracy-efficiency frontier over standard algorithms by reducing average output tokens by up to 80% and reducing thinking rate by up to 95% while maintaining or improving tool call accuracy. These results highlight that training-signal selection is a powerful lever for efficient reasoning in latency-sensitive deployments.

Ada-RS: Adaptive Rejection Sampling for Selective Thinking

TL;DR

This work tackles latency- and cost-sensitive deployment of reasoning-enabled LLMs by promoting selective thinking. It introduces Adaptive Rejection Sampling (Ada-RS), an algorithm-agnostic framework that filters training samples using an adaptive, efficiency-aware reward to downweight verbose reasoning while preserving necessary thinking on harder inputs. Ada-RS can plug into both DPO (Ada-RS-DPO) and grouped-policy optimization (Ada-RS-DAPO), yielding substantial improvements on an e-commerce tool-calling benchmark: up to 70–80% reduction in output tokens and up to 95% reduction in thinking rate without sacrificing tool-call accuracy. The approach demonstrates that training-signal construction and selective sample filtering are powerful levers for efficient reasoning, offering practical benefits for latency-constrained AI systems.

Abstract

Large language models (LLMs) are increasingly being deployed in cost and latency-sensitive settings. While chain-of-thought improves reasoning, it can waste tokens on simple requests. We study selective thinking for tool-using LLMs and introduce Adaptive Rejection Sampling (Ada-RS), an algorithm-agnostic sample filtering framework for learning selective and efficient reasoning. For each given context, Ada-RS scores multiple sampled completions with an adaptive length-penalized reward then applies stochastic rejection sampling to retain only high-reward candidates (or preference pairs) for downstream optimization. We demonstrate how Ada-RS plugs into both preference pair (e.g. DPO) or grouped policy optimization strategies (e.g. DAPO). Using Qwen3-8B with LoRA on a synthetic tool call-oriented e-commerce benchmark, Ada-RS improves the accuracy-efficiency frontier over standard algorithms by reducing average output tokens by up to 80% and reducing thinking rate by up to 95% while maintaining or improving tool call accuracy. These results highlight that training-signal selection is a powerful lever for efficient reasoning in latency-sensitive deployments.
Paper Structure (33 sections, 4 equations, 4 figures, 2 tables)

This paper contains 33 sections, 4 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Comparison between thinking always reasoning and selective reasoning in a tool-calling LLM agent. The agent performs explicit reasoning even for a simple user query, resulting in unnecessary inference cost and latency (left). The agent selectively skips reasoning and directly calls the appropriate tool (right).
  • Figure 2: Overview of the proposed Ads-RS training framework.Off-policy Ada-RS-DPO training pipeline (left). Given an input context $x$, a teacher policy model generates multiple rollouts. Ada-RS performs pair-wise rejection sampling based on adaptive thinking reward signals for selective thinking to construct high-quality preference pairs, which are then used to optimize the student policy model via DPO training. On-policy Ada-RS-DAPO training pipeline (right). The current policy model generates multiple rollouts for the input context. Ada-RS applies group-wise rejection sampling based on adaptive thinking rewards to select informative candidate subsets. The resulting candidate group is used to update the policy through an on-policy DAPO training.
  • Figure 3: Tool call accuracy versus Thinking Rate across methods. The most favorable target region (high accuracy, low thinking rate) is highlighted. Points are colored by algorithm: DPO (green), DAPO (blue), SFT (orange), and no-fine-tuning/base model (red).
  • Figure 4: Tool call accuracy and average Output token across methods relative to the Qwen3-8B base model with thinking enabled. Numbers above the bars show percentage decrease in average amount of output tokens relative to the base model. Bars are colored by algorithm: DPO (green), DAPO (blue), and SFT (orange).