Table of Contents
Fetching ...

Exploring the Necessity of Reasoning in LLM-based Agent Scenarios

Xueyang Zhou, Guiyao Tie, Guowen Zhang, Weidong Wang, Zhigang Zuo, Di Wu, Duanfeng Chu, Pan Zhou, Neil Zhenqiang Gong, Lichao Sun

TL;DR

This work examines whether explicit reasoning is essential for LLM-based agents in the era of Large Reasoning Models (LRMs). It introduces LaRMA, a three-phase framework that segments tasks, evaluates generic agent paradigms (ReAct and Reflexion), and benchmarks diverse LLMs and LRMs across multiple datasets with rigorous metrics. Key findings show LRMs excel at reasoning-intensive tasks like Plan Design and Problem Solving, while LLMs outperform in execution-focused Tool Usage; hybrid actor-reflector configurations further enhance performance, especially under Reflexion. However, LRMs incur higher computational costs and exhibit behavioral challenges such as overthinking and fact-ignoring tendencies, motivating balanced, hybrid designs for practical agent systems with improved efficiency and reliability.

Abstract

The rise of Large Reasoning Models (LRMs) signifies a paradigm shift toward advanced computational reasoning. Yet, this progress disrupts traditional agent frameworks, traditionally anchored by execution-oriented Large Language Models (LLMs). To explore this transformation, we propose the LaRMA framework, encompassing nine tasks across Tool Usage, Plan Design, and Problem Solving, assessed with three top LLMs (e.g., Claude3.5-sonnet) and five leading LRMs (e.g., DeepSeek-R1). Our findings address four research questions: LRMs surpass LLMs in reasoning-intensive tasks like Plan Design, leveraging iterative reflection for superior outcomes; LLMs excel in execution-driven tasks such as Tool Usage, prioritizing efficiency; hybrid LLM-LRM configurations, pairing LLMs as actors with LRMs as reflectors, optimize agent performance by blending execution speed with reasoning depth; and LRMs' enhanced reasoning incurs higher computational costs, prolonged processing, and behavioral challenges, including overthinking and fact-ignoring tendencies. This study fosters deeper inquiry into LRMs' balance of deep thinking and overthinking, laying a critical foundation for future agent design advancements.

Exploring the Necessity of Reasoning in LLM-based Agent Scenarios

TL;DR

This work examines whether explicit reasoning is essential for LLM-based agents in the era of Large Reasoning Models (LRMs). It introduces LaRMA, a three-phase framework that segments tasks, evaluates generic agent paradigms (ReAct and Reflexion), and benchmarks diverse LLMs and LRMs across multiple datasets with rigorous metrics. Key findings show LRMs excel at reasoning-intensive tasks like Plan Design and Problem Solving, while LLMs outperform in execution-focused Tool Usage; hybrid actor-reflector configurations further enhance performance, especially under Reflexion. However, LRMs incur higher computational costs and exhibit behavioral challenges such as overthinking and fact-ignoring tendencies, motivating balanced, hybrid designs for practical agent systems with improved efficiency and reliability.

Abstract

The rise of Large Reasoning Models (LRMs) signifies a paradigm shift toward advanced computational reasoning. Yet, this progress disrupts traditional agent frameworks, traditionally anchored by execution-oriented Large Language Models (LLMs). To explore this transformation, we propose the LaRMA framework, encompassing nine tasks across Tool Usage, Plan Design, and Problem Solving, assessed with three top LLMs (e.g., Claude3.5-sonnet) and five leading LRMs (e.g., DeepSeek-R1). Our findings address four research questions: LRMs surpass LLMs in reasoning-intensive tasks like Plan Design, leveraging iterative reflection for superior outcomes; LLMs excel in execution-driven tasks such as Tool Usage, prioritizing efficiency; hybrid LLM-LRM configurations, pairing LLMs as actors with LRMs as reflectors, optimize agent performance by blending execution speed with reasoning depth; and LRMs' enhanced reasoning incurs higher computational costs, prolonged processing, and behavioral challenges, including overthinking and fact-ignoring tendencies. This study fosters deeper inquiry into LRMs' balance of deep thinking and overthinking, laying a critical foundation for future agent design advancements.

Paper Structure

This paper contains 48 sections, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Overall Performance in the ReAct Paradigm. a) Performance across tasks and models; b) Efficiency and cost comparisons.
  • Figure 2: Overview of the LaRMA Investigation Process. The study advances through: 1) Task Segmentation for Agent Capabilities; 2) Selection for Generic Agent Paradigms; 3) Comprehensive Evaluation with Different LLMs and LRMs. Where, $a$ and $o$ represent action and observation respectively. $r_{sca.}$ and $r_{ver.}$ represent scalar rewards given by the evaluator and verbal rewards given by the reflector respectively.
  • Figure 3: Performance trends across Reflexion iterations. This figure illustrates the accuracy progression under the Reflexion paradigm across 5 rounds. Models are denoted as follows: LLaMA3.1-70B (L3.1-70B), GPT-4o, Claude3.5-sonnet (CL3.5), DeepSeek-R1 (DS-R1), Claude3.7-sonnet (CL3.7), Gemini-2.0-Flash (Gemini-2.0), QWQ-32B-Preview (QWQ-32B) and GLM-zero.
  • Figure 4: Probability Distributions of Token Usage and Execution Time for LLMs and LRMs Across Three Task Domains.
  • Figure 5: Exploration of Overthinking Rates for DeepSeek-R1 and Claude3.7-sonnet.
  • ...and 6 more figures