HiPO: Hybrid Policy Optimization for Dynamic Reasoning in LLMs

Ken Deng; Zizheng Zhan; Wen Xiang; Wenqiang Zhu; Weihao Li; Jingxuan Xu; Tianhao Peng; Xinping Lei; Kun Wu; Yifan Yao; Haoyang Huang; Huaixi Tang; Kepeng Lei; Zhiyi Lai; Songwei Yu; Zongxian Feng; Zuchen Gao; Weihao Xie; Chenchen Zhang; Yanan Wu; Yuanxing Zhang; Lecheng Huang; Yuqun Zhang; Jie Liu; Zhaoxiang Zhang; Haotian Zhang; Bin Chen; Jiaheng Liu

HiPO: Hybrid Policy Optimization for Dynamic Reasoning in LLMs

Ken Deng, Zizheng Zhan, Wen Xiang, Wenqiang Zhu, Weihao Li, Jingxuan Xu, Tianhao Peng, Xinping Lei, Kun Wu, Yifan Yao, Haoyang Huang, Huaixi Tang, Kepeng Lei, Zhiyi Lai, Songwei Yu, Zongxian Feng, Zuchen Gao, Weihao Xie, Chenchen Zhang, Yanan Wu, Yuanxing Zhang, Lecheng Huang, Yuqun Zhang, Jie Liu, Zhaoxiang Zhang, Haotian Zhang, Bin Chen, Jiaheng Liu

TL;DR

HiPO tackles the inefficiency of universal chain-of-thought by enabling adaptive reasoning via Think-on/Think-off. It introduces a Hybrid Data Construction Pipeline and a Hybrid RL Reward System to train models to decide when to think and when to answer concisely. Across mathematics and coding benchmarks, HiPO achieves substantial token-length reductions while preserving or improving accuracy, outperforming existing adaptive reasoning methods. This principled approach advances efficient, reasoning-oriented LLM deployment in resource-sensitive settings.

Abstract

Large Language Models (LLMs) increasingly rely on Chain-of-Thought (CoT) reasoning to improve accuracy on complex tasks. However, always generating lengthy reasoning traces is inefficient, leading to excessive token usage and higher inference costs. This paper introduces the Hybrid Policy Optimization (i.e., HiPO), a framework for adaptive reasoning control that enables LLMs to selectively decide when to engage in detailed reasoning (Think-on) and when to respond directly (Think-off). Specifically, HiPO combines a hybrid data pipelineproviding paired Think-on and Think-off responseswith a hybrid reinforcement learning reward system that balances accuracy and efficiency while avoiding over-reliance on detailed reasoning. Experiments across mathematics and coding benchmarks demonstrate that HiPO can substantially reduce token length while maintaining or improving accuracy. Finally, we hope HiPO a can be a principled approach for efficient adaptive reasoning, advancing the deployment of reasoning-oriented LLMs in real-world, resource-sensitive settings.

HiPO: Hybrid Policy Optimization for Dynamic Reasoning in LLMs

TL;DR

Abstract

HiPO: Hybrid Policy Optimization for Dynamic Reasoning in LLMs

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)