Think Fast and Slow: Step-Level Cognitive Depth Adaptation for LLM Agents

Ruihan Yang; Fanghua Ye; Xiang We; Ruoqing Zhao; Kang Luo; Xinbo Xu; Bo Zhao; Ruotian Ma; Shanyi Wang; Zhaopeng Tu; Xiaolong Li; Deqing Yang; Linus

Think Fast and Slow: Step-Level Cognitive Depth Adaptation for LLM Agents

Ruihan Yang, Fanghua Ye, Xiang We, Ruoqing Zhao, Kang Luo, Xinbo Xu, Bo Zhao, Ruotian Ma, Shanyi Wang, Zhaopeng Tu, Xiaolong Li, Deqing Yang, Linus

TL;DR

CogRouter introduces step-level cognitive depth adaptation for LLM agents grounded in ACT-R, defining four cognitive levels and a two-stage training pipeline (CoSFT and CoPO) to learn stable level patterns and perform step-wise credit assignment via confidence-aware reweighting. Through experiments on ALFWorld and ScienceWorld, it achieves state-of-the-art task success with substantially lower token usage compared to fixed-pattern and trajectory-level RL baselines. The approach addresses cognitive rigidity in long-horizon agent tasks and demonstrates dynamic depth allocation that scales with task complexity. Overall, CogRouter offers a principled framework for efficient, adaptive reasoning in embodied and applied LLM agent settings.

Abstract

Large language models (LLMs) are increasingly deployed as autonomous agents for multi-turn decision-making tasks. However, current agents typically rely on fixed cognitive patterns: non-thinking models generate immediate responses, while thinking models engage in deep reasoning uniformly. This rigidity is inefficient for long-horizon tasks, where cognitive demands vary significantly from step to step, with some requiring strategic planning and others only routine execution. In this paper, we introduce CogRouter, a framework that trains agents to dynamically adapt cognitive depth at each step. Grounded in ACT-R theory, we design four hierarchical cognitive levels ranging from instinctive responses to strategic planning. Our two-stage training approach includes Cognition-aware Supervised Fine-tuning (CoSFT) to instill stable level-specific patterns, and Cognition-aware Policy Optimization (CoPO) for step-level credit assignment via confidence-aware advantage reweighting. The key insight is that appropriate cognitive depth should maximize the confidence of the resulting action. Experiments on ALFWorld and ScienceWorld demonstrate that CogRouter achieves state-of-the-art performance with superior efficiency. With Qwen2.5-7B, it reaches an 82.3% success rate, outperforming GPT-4o (+40.3%), OpenAI-o3 (+18.3%), and GRPO (+14.0%), while using 62% fewer tokens.

Think Fast and Slow: Step-Level Cognitive Depth Adaptation for LLM Agents

TL;DR

Abstract

Paper Structure (46 sections, 11 equations, 9 figures, 5 tables, 1 algorithm)

This paper contains 46 sections, 11 equations, 9 figures, 5 tables, 1 algorithm.

Introduction
Preliminary
Partially Observable Markov Decision Process
GRPO for Agentic Tasks
Methodology
Task Formulation
Cognitive Level Design
Cognition-Aware Supervised Fine-tuning
Cognition-Aware Policy Optimization
Reward Design
Cognitive Group Construction
Confidence-Aware Advantage Reweighting
CoPO Optimization
Experiments
Experimental Setup
...and 31 more sections

Figures (9)

Figure 1: Illustration of the cognitive rigidity issue: While CoPO maintains an adaptive cognitive distribution (bottom), standard RL methods like GRPO (top) collapse to uniform deep thinking ($\mathcal{L}_4$), wasting computational resources on routine steps. $\mathcal{L}_1$–$\mathcal{L}_4$ represent increasing cognitive depth, from instinctive responses to strategic planning.
Figure 2: Overview of the CogRouter framework. We define four cognitive levels $\mathcal{L}_{1}$–$\mathcal{L}_{4}$, then introduce a two-stage training process: (1) Cognition-aware Supervised Fine-tuning (CoSFT), which guides the model to learn stable cognitive patterns across levels with balanced data; (2) Cognition-aware Policy Optimization (CoPO), which applies RL with confidence-aware reweighting to help the model adaptively choose suitable levels based on context complexity.
Figure 3: Cognitive level distribution after training. All RL methods (GRPO, GiGPO, CoPO) are initialized from CoSFT. While GRPO and GiGPO collapse to predominantly $\mathcal{L}_4$ thinking, CoPO learns adaptive allocation.
Figure 4: Training curves showing success rate across RL iterations for GRPO, GiGPO and CoPO on ALFWorld and ScienceWorld. CoPO achieves faster convergence to higher success rates.
Figure 5: Cognitive level distributions across trajectory progress (left) and task complexity (right) for CoPO and GRPO on ScienceWorld.
...and 4 more figures

Think Fast and Slow: Step-Level Cognitive Depth Adaptation for LLM Agents

TL;DR

Abstract

Think Fast and Slow: Step-Level Cognitive Depth Adaptation for LLM Agents

Authors

TL;DR

Abstract

Table of Contents

Figures (9)