ReThinker: Scientific Reasoning by Rethinking with Guided Reflection and Confidence Control
Zhentao Tang, Yuqi Cui, Shixiong Kai, Wenqian Zhao, Ke Ye, Xing Li, Anxin Tian, Zehua Pei, Hui-Ling Zhen, Shoubo Hu, Xiaoguang Li, Yunhe Wang, Mingxuan Yuan
TL;DR
ReThinker addresses expert-level scientific reasoning by integrating retrieval, tool use, and multi-agent reasoning within a stage-wise Solver-Critic-Selector architecture guided by explicit confidence signals. It couples automated post-training data synthesis with adaptive trajectory recycling and a hybrid EvoFabric-based scaling strategy to balance computation and accuracy. Empirical results on Humanity's Last Exam, GAIA, and XBench demonstrate state-of-the-art performance, especially on long-horizon, multi-hop tasks that demand precise tool orchestration and error correction. The framework advances robust, scalable reasoning in scientific domains, while acknowledging limitations in context length and runtime latency and outlining directions for longer contexts and domain-specific tooling.
Abstract
Expert-level scientific reasoning remains challenging for large language models, particularly on benchmarks such as Humanity's Last Exam (HLE), where rigid tool pipelines, brittle multi-agent coordination, and inefficient test-time scaling often limit performance. We introduce ReThinker, a confidence-aware agentic framework that orchestrates retrieval, tool use, and multi-agent reasoning through a stage-wise Solver-Critic-Selector architecture. Rather than following a fixed pipeline, ReThinker dynamically allocates computation based on model confidence, enabling adaptive tool invocation, guided multi-dimensional reflection, and robust confidence-weighted selection. To support scalable training without human annotation, we further propose a reverse data synthesis pipeline and an adaptive trajectory recycling strategy that transform successful reasoning traces into high-quality supervision. Experiments on HLE, GAIA, and XBench demonstrate that ReThinker consistently outperforms state-of-the-art foundation models with tools and existing deep research systems, achieving state-of-the-art results on expert-level reasoning tasks.
