Human-in-the-Loop through Chain-of-Thought
Zefan Cai, Baobao Chang, Wenjuan Han
TL;DR
This work tackles the limitations of chain-of-thought prompting in long-horizon reasoning by introducing a manual correction system (MCS) that injects human corrections into sub-logics of generated rationales. MCS uses a four-stage pipeline (sampling, filtering by diversity entropy, targeted sub-logics correction, and re-prompted final answer) and is paired with CAMLOP, a cost-utility framework that balances human labor and LLM usage against accuracy and user satisfaction. Across twelve datasets covering arithmetic, commonsense, and symbolic reasoning, MCS consistently outperforms strong baselines, with notable gains when combined with Self-consistency, and analyses showing many errors arise from fixable sub-logics. The work also provides thorough cost-utility analyses, showing practical trade-offs and guidance for deploying human-in-the-loop reasoning in real-world settings. Overall, MCS and CAMLOP offer a rigorous, scalable approach to enhancing LLM reasoning while controlling costs, with strong empirical support for their effectiveness and applicability.
Abstract
While the emergence of powerful language models along with Chain-of-thought prompting has made automation more and more omnipresent, it sometimes demonstrates its weakness in long-term or multi-step logical reasoning. For example, users don't always get desirable answers for complex mathematical problems without human involvement. Against this background, we present the Manual Correction System (MCS) -- a human-in-the-loop system enhanced by Chain-of-Thought prompting, which explores how manual correction of sub-logics in rationales can improve LLM's reasoning performance. Moving one step forward, considering a system with human-in-the-loop involves more than having humans improve performance but also controlling the cost. Therefore, we post a Cost-utility Analysis Model for Human-in-the-Loop systems (CAMLOP) based on classical economics theory to analyze, quantify and balance the utility and the corresponding cost. We conduct experiments of MCS and CAMLOP with twelve datasets. A significant advantage w.r.t cost and utility proves its superiority over strong baselines.
