Table of Contents
Fetching ...

HeavySkill: Heavy Thinking as the Inner Skill in Agentic Harness

Jianing Wang, Linsen Guo, Zhengyu Chen, Qi Guo, Hongyu Zang, Wenjie Shi, Haoxiang Ma, Xiangyu Xi, Xiaoyu Li, Wei Wang, Xunliang Cai

Abstract

Recent advances in agentic harness with orchestration frameworks that coordinate multiple agents with memory, skills, and tool use have achieved remarkable success in complex reasoning tasks. However, the underlying mechanism that truly drives performance remains obscured behind intricate system designs. In this paper, we propose HeavySkill, a perspective that views heavy thinking not only as a minimal execution unit in orchestration harness but also as an inner skill internalized within the model's parameters that drives the orchestrator to solve complex tasks. We identify this skill as a two-stage pipeline, i.e., parallel reasoning then summarization, which can operate beneath any agentic harness. We present a systematic empirical study of HeavySkill across diverse domains. Our results show that this inner skill consistently outperforms traditional Best-of-N (BoN) strategies; notably, stronger LLMs can even approach Pass@N performance. Crucially, we demonstrate that the depth and width of heavy thinking, as a learnable skill, can be further scaled via reinforcement learning, offering a promising path toward self-evolving LLMs that internalize complex reasoning without relying on brittle orchestration layers.

HeavySkill: Heavy Thinking as the Inner Skill in Agentic Harness

Abstract

Recent advances in agentic harness with orchestration frameworks that coordinate multiple agents with memory, skills, and tool use have achieved remarkable success in complex reasoning tasks. However, the underlying mechanism that truly drives performance remains obscured behind intricate system designs. In this paper, we propose HeavySkill, a perspective that views heavy thinking not only as a minimal execution unit in orchestration harness but also as an inner skill internalized within the model's parameters that drives the orchestrator to solve complex tasks. We identify this skill as a two-stage pipeline, i.e., parallel reasoning then summarization, which can operate beneath any agentic harness. We present a systematic empirical study of HeavySkill across diverse domains. Our results show that this inner skill consistently outperforms traditional Best-of-N (BoN) strategies; notably, stronger LLMs can even approach Pass@N performance. Crucially, we demonstrate that the depth and width of heavy thinking, as a learnable skill, can be further scaled via reinforcement learning, offering a promising path toward self-evolving LLMs that internalize complex reasoning without relying on brittle orchestration layers.

Paper Structure

This paper contains 32 sections, 10 figures, 3 tables.

Figures (10)

  • Figure 1: The overview framework of heavy thinking in LLMs test time scaling.
  • Figure 2: The pass rate distribution of heavy thinking in different pass rates of parallel reasoning.
  • Figure 3: When fixing the LLM as R1-Distill-Qwen-7B in the parallel reasoning phase, the final performance of different LLMs in sequential deliberation.
  • Figure 4: The effectiveness of different numbers of iterations.
  • Figure 5: When choosing different permutations. Random: randomly select $K$ trajectories; Max-Diversity: select $K$ trajectories that have the highest diversity; Max-Length: select the top $K$ trajectories based on the length; Max-Answer-Num: select the trajectories that have the highest frequency answer.
  • ...and 5 more figures