Table of Contents
Fetching ...

ReThinker: Scientific Reasoning by Rethinking with Guided Reflection and Confidence Control

Zhentao Tang, Yuqi Cui, Shixiong Kai, Wenqian Zhao, Ke Ye, Xing Li, Anxin Tian, Zehua Pei, Hui-Ling Zhen, Shoubo Hu, Xiaoguang Li, Yunhe Wang, Mingxuan Yuan

TL;DR

ReThinker addresses expert-level scientific reasoning by integrating retrieval, tool use, and multi-agent reasoning within a stage-wise Solver-Critic-Selector architecture guided by explicit confidence signals. It couples automated post-training data synthesis with adaptive trajectory recycling and a hybrid EvoFabric-based scaling strategy to balance computation and accuracy. Empirical results on Humanity's Last Exam, GAIA, and XBench demonstrate state-of-the-art performance, especially on long-horizon, multi-hop tasks that demand precise tool orchestration and error correction. The framework advances robust, scalable reasoning in scientific domains, while acknowledging limitations in context length and runtime latency and outlining directions for longer contexts and domain-specific tooling.

Abstract

Expert-level scientific reasoning remains challenging for large language models, particularly on benchmarks such as Humanity's Last Exam (HLE), where rigid tool pipelines, brittle multi-agent coordination, and inefficient test-time scaling often limit performance. We introduce ReThinker, a confidence-aware agentic framework that orchestrates retrieval, tool use, and multi-agent reasoning through a stage-wise Solver-Critic-Selector architecture. Rather than following a fixed pipeline, ReThinker dynamically allocates computation based on model confidence, enabling adaptive tool invocation, guided multi-dimensional reflection, and robust confidence-weighted selection. To support scalable training without human annotation, we further propose a reverse data synthesis pipeline and an adaptive trajectory recycling strategy that transform successful reasoning traces into high-quality supervision. Experiments on HLE, GAIA, and XBench demonstrate that ReThinker consistently outperforms state-of-the-art foundation models with tools and existing deep research systems, achieving state-of-the-art results on expert-level reasoning tasks.

ReThinker: Scientific Reasoning by Rethinking with Guided Reflection and Confidence Control

TL;DR

ReThinker addresses expert-level scientific reasoning by integrating retrieval, tool use, and multi-agent reasoning within a stage-wise Solver-Critic-Selector architecture guided by explicit confidence signals. It couples automated post-training data synthesis with adaptive trajectory recycling and a hybrid EvoFabric-based scaling strategy to balance computation and accuracy. Empirical results on Humanity's Last Exam, GAIA, and XBench demonstrate state-of-the-art performance, especially on long-horizon, multi-hop tasks that demand precise tool orchestration and error correction. The framework advances robust, scalable reasoning in scientific domains, while acknowledging limitations in context length and runtime latency and outlining directions for longer contexts and domain-specific tooling.

Abstract

Expert-level scientific reasoning remains challenging for large language models, particularly on benchmarks such as Humanity's Last Exam (HLE), where rigid tool pipelines, brittle multi-agent coordination, and inefficient test-time scaling often limit performance. We introduce ReThinker, a confidence-aware agentic framework that orchestrates retrieval, tool use, and multi-agent reasoning through a stage-wise Solver-Critic-Selector architecture. Rather than following a fixed pipeline, ReThinker dynamically allocates computation based on model confidence, enabling adaptive tool invocation, guided multi-dimensional reflection, and robust confidence-weighted selection. To support scalable training without human annotation, we further propose a reverse data synthesis pipeline and an adaptive trajectory recycling strategy that transform successful reasoning traces into high-quality supervision. Experiments on HLE, GAIA, and XBench demonstrate that ReThinker consistently outperforms state-of-the-art foundation models with tools and existing deep research systems, achieving state-of-the-art results on expert-level reasoning tasks.
Paper Structure (43 sections, 6 equations, 5 figures, 15 tables, 4 algorithms)

This paper contains 43 sections, 6 equations, 5 figures, 15 tables, 4 algorithms.

Figures (5)

  • Figure 1: Performance comparison on the HLE benchmark. The results include Foundation Models with Tools, existing Inference Frameworks, and our proposed method ReThinker based on two LLMs. ReThinker (based on Gemini-3-Pro) significantly outperforming both standalone models and other inference frameworks.
  • Figure 2: Overall Framework of ReThinker: A Data-Driven and Uncertainty-Guided Agentic System for Expert-Level Scientific Reasoning. The framework comprises three integrated phases: (A) Post-Training Data Synthesis & Curation, where trajectory recycling and a validation agent generate and refine expert QA pairs through correctness checks, formatting, deduplication, and quality balancing; (B) Multi-Path Iterative Reasoning, where parallel Solver-Critic paths execute tool-enhanced reasoning to produce candidate trajectories from user queries; and (C) Confidence-Guided Selection, a three-stage process employing Latin Square Permutation Test for initial judgment, iterative re-selection conditioned on historical data and perplexity scores (PPLs), and unanimous voting for final decision. The system features dual feedback loops—Data Recycling Flow and Iterative Bootstrapping Flow—that continuously enhance the knowledge foundation and reasoning capabilities.
  • Figure 3: Tool Usage Statistics across Reasoning Phases in ReThinker.
  • Figure 4: Distributional Shift in Correct-Answer Trajectories from Solver to Critic.
  • Figure 5: Separation between Correct and Incorrect Answers Induced by Perplexity.