Human-in-the-Loop through Chain-of-Thought

Zefan Cai; Baobao Chang; Wenjuan Han

Human-in-the-Loop through Chain-of-Thought

Zefan Cai, Baobao Chang, Wenjuan Han

TL;DR

This work tackles the limitations of chain-of-thought prompting in long-horizon reasoning by introducing a manual correction system (MCS) that injects human corrections into sub-logics of generated rationales. MCS uses a four-stage pipeline (sampling, filtering by diversity entropy, targeted sub-logics correction, and re-prompted final answer) and is paired with CAMLOP, a cost-utility framework that balances human labor and LLM usage against accuracy and user satisfaction. Across twelve datasets covering arithmetic, commonsense, and symbolic reasoning, MCS consistently outperforms strong baselines, with notable gains when combined with Self-consistency, and analyses showing many errors arise from fixable sub-logics. The work also provides thorough cost-utility analyses, showing practical trade-offs and guidance for deploying human-in-the-loop reasoning in real-world settings. Overall, MCS and CAMLOP offer a rigorous, scalable approach to enhancing LLM reasoning while controlling costs, with strong empirical support for their effectiveness and applicability.

Abstract

While the emergence of powerful language models along with Chain-of-thought prompting has made automation more and more omnipresent, it sometimes demonstrates its weakness in long-term or multi-step logical reasoning. For example, users don't always get desirable answers for complex mathematical problems without human involvement. Against this background, we present the Manual Correction System (MCS) -- a human-in-the-loop system enhanced by Chain-of-Thought prompting, which explores how manual correction of sub-logics in rationales can improve LLM's reasoning performance. Moving one step forward, considering a system with human-in-the-loop involves more than having humans improve performance but also controlling the cost. Therefore, we post a Cost-utility Analysis Model for Human-in-the-Loop systems (CAMLOP) based on classical economics theory to analyze, quantify and balance the utility and the corresponding cost. We conduct experiments of MCS and CAMLOP with twelve datasets. A significant advantage w.r.t cost and utility proves its superiority over strong baselines.

Human-in-the-Loop through Chain-of-Thought

TL;DR

Abstract

Paper Structure (36 sections, 9 equations, 6 figures, 17 tables)

This paper contains 36 sections, 9 equations, 6 figures, 17 tables.

Introduction
Manual Correction System
Filtering Stage
Correction Stage
Cost-utility Analysis Model for Human-in-the-Loop Systems
Experiments
Setup
Tasks and datasets.
Baselines.
Models and scales.
Sampling scheme.
Main Results
Arithmetic Reasoning
Commonsense and Symbolic Reasoning
Analysis of Whether Correcting Sub-logics Solves the Majority of Incorrect Rationales
...and 21 more sections

Figures (6)

Figure 1: MCS comprises four stages: (1) sampling stage prompting the LLM using CoT prompting and replacing the greedy decoding by sampling from the LLMâ€™s decoder to generate a set of rationales (i.e., the complete logical chain of CoT output); (2) filtering stage filtering out the samples ranked high by Diversity Entropy; (3) correction stage manually adding, deleting and modifying erroneous sub-logics in the most likely rationale of the filtered sample, and (4) answer stage prompting the LLM using CoT prompting again with manually corrected sub-logics and using greedy decoding to obtain the final answer.
Figure 2: Illustration of CAMLOP.
Figure 3: Illustration of error analysis of Chain of Thought Prompting across twelve tasks. Each error type is represented by a color. The share in color indicates the share of the error type.
Figure 4: Results of different thresholds of DE. It shows the results of MCS with 5%, 10%, 20%, 30%, 40% and 50% DE for AddSub (Left), SingleEq (Medium) and SingleOp (Right). Results show that DE-based filtering is an efficient method to rank the possibility to be incorrect for the output of CoT predictions, and samples with incorrect output will be ranked higher than those without.
Figure 5: ROC Curves for DE to filter out the incorrect CoT outputs. It shows the ROC Curve for AddSub (Left), Singleeq (Medium) and SingleOp (Right). The results indicate that DE is a reliable metrics that can determine the samples most likely to be incorrectly predicted for humans to involve.
...and 1 more figures

Human-in-the-Loop through Chain-of-Thought

TL;DR

Abstract

Human-in-the-Loop through Chain-of-Thought

Authors

TL;DR

Abstract

Table of Contents

Figures (6)