Table of Contents
Fetching ...

Thinking Makes LLM Agents Introverted: How Mandatory Thinking Can Backfire in User-Engaged Agents

Jiatong Li, Changdae Oh, Hyeong Kyu Choi, Jindong Wang, Sharon Li

TL;DR

This study challenges the conventional wisdom that explicit, test-time thinking improves performance for LLM agents in real-world, user-facing tasks. By systematically evaluating seven models across three benchmarks and two thinking instantiations (TaaF and TaaP), the authors show that mandatory thinking often degrades task success in multi-turn interactions due to reduced information disclosure and more introverted agent behavior. They diagnose the mechanism with a fine-grained response taxonomy and case studies, demonstrating that shorter, less informative replies hinder clarification and progression. The authors additionally demonstrate that a simple information-disclosure prompting approach (InfoDis) can recover and improve performance across model families, highlighting transparency as a practical axis for agent optimization. Overall, the work advocates interaction-aware design and evaluation of reasoning mechanisms to ensure robust performance in realistic, user-engaged settings.

Abstract

Eliciting reasoning has emerged as a powerful technique for improving the performance of large language models (LLMs) on complex tasks by inducing thinking. However, their effectiveness in realistic user-engaged agent scenarios remains unclear. In this paper, we conduct a comprehensive study on the effect of explicit thinking in user-engaged LLM agents. Our experiments span across seven models, three benchmarks, and two thinking instantiations, and we evaluate them through both a quantitative response taxonomy analysis and qualitative failure propagation case studies. Contrary to expectations, we find that mandatory thinking often backfires on agents in user-engaged settings, causing anomalous performance degradation across various LLMs. Our key finding reveals that thinking makes agents more ``introverted'' by shortening responses and reducing information disclosure to users, which weakens agent-user information exchange and leads to downstream task failures. Furthermore, we demonstrate that explicitly prompting for information disclosure reliably improves performance across diverse model families, suggesting that proactive transparency is a vital lever for agent optimization. Overall, our study suggests that information transparency awareness is a crucial yet underexplored perspective for the future design of reasoning agents in real-world scenarios. Our code is available at https://github.com/deeplearning-wisc/Thinking-Agent.

Thinking Makes LLM Agents Introverted: How Mandatory Thinking Can Backfire in User-Engaged Agents

TL;DR

This study challenges the conventional wisdom that explicit, test-time thinking improves performance for LLM agents in real-world, user-facing tasks. By systematically evaluating seven models across three benchmarks and two thinking instantiations (TaaF and TaaP), the authors show that mandatory thinking often degrades task success in multi-turn interactions due to reduced information disclosure and more introverted agent behavior. They diagnose the mechanism with a fine-grained response taxonomy and case studies, demonstrating that shorter, less informative replies hinder clarification and progression. The authors additionally demonstrate that a simple information-disclosure prompting approach (InfoDis) can recover and improve performance across model families, highlighting transparency as a practical axis for agent optimization. Overall, the work advocates interaction-aware design and evaluation of reasoning mechanisms to ensure robust performance in realistic, user-engaged settings.

Abstract

Eliciting reasoning has emerged as a powerful technique for improving the performance of large language models (LLMs) on complex tasks by inducing thinking. However, their effectiveness in realistic user-engaged agent scenarios remains unclear. In this paper, we conduct a comprehensive study on the effect of explicit thinking in user-engaged LLM agents. Our experiments span across seven models, three benchmarks, and two thinking instantiations, and we evaluate them through both a quantitative response taxonomy analysis and qualitative failure propagation case studies. Contrary to expectations, we find that mandatory thinking often backfires on agents in user-engaged settings, causing anomalous performance degradation across various LLMs. Our key finding reveals that thinking makes agents more ``introverted'' by shortening responses and reducing information disclosure to users, which weakens agent-user information exchange and leads to downstream task failures. Furthermore, we demonstrate that explicitly prompting for information disclosure reliably improves performance across diverse model families, suggesting that proactive transparency is a vital lever for agent optimization. Overall, our study suggests that information transparency awareness is a crucial yet underexplored perspective for the future design of reasoning agents in real-world scenarios. Our code is available at https://github.com/deeplearning-wisc/Thinking-Agent.
Paper Structure (39 sections, 2 equations, 19 figures, 7 tables)

This paper contains 39 sections, 2 equations, 19 figures, 7 tables.

Figures (19)

  • Figure 1: An overview of the research framework.
  • Figure 2: Overall performance of agents with or without thinking on Retail and Airline.
  • Figure 3: Response scale of agents with or without thinking. The average number of tokens for trajectories measures the scale.
  • Figure 4: A summary of the task case in $\tau$-Retail. The blue highlights the milestone goal and action.
  • Figure 5: An example of response taxonomy. The response is divided into atomic statements. Yellow denotes information disclosure, while blue denotes user engagement request.
  • ...and 14 more figures