Table of Contents
Fetching ...

DICE: Structured Reasoning in LLMs through SLM-Guided Chain-of-Thought Correction

Yiqi Li, Yusheng Liao, Zhe Chen, Yanfeng Wang, Yu Wang

TL;DR

DICE, a lightweight framework that guides small language models (SLMs) to refine LLMs' outputs through chain-of-thought (CoT) correction, preserves LLMs'broad knowledge and reasoning capabilities while ensuring the outputs conform to user demands.

Abstract

When performing reasoning tasks with user-specific requirements, such as strict output formats, large language models (LLMs) often prioritize reasoning over adherence to detailed instructions. Fine-tuning LLMs on supervised datasets to address this is impractical due to high computational costs and limited parameter access. To tackle this, we propose DICE, a lightweight framework that guides small language models (SLMs) to refine LLMs' outputs through chain-of-thought (CoT) correction. DICE decouples the process by first prompting LLMs to generate natural language responses, then using trained SLMs to analyze and refine these outputs to meet structured output specifications. This framework preserves LLMs' broad knowledge and reasoning capabilities while ensuring the outputs conform to user demands. Specifically, DICE first constructs structured CoT adaptation datasets via a two-stage method and subsequently applies a dual-tuning strategy to fine-tune SLMs for generating structured outputs in an analyze-then-answer pattern. Experiments demonstrate that DICE improves the average format accuracy and content correctness of LLM outputs by 35.4\% and 29.4\%, respectively, achieving state-of-the-art (SOTA) performance over other competitive baselines.

DICE: Structured Reasoning in LLMs through SLM-Guided Chain-of-Thought Correction

TL;DR

DICE, a lightweight framework that guides small language models (SLMs) to refine LLMs' outputs through chain-of-thought (CoT) correction, preserves LLMs'broad knowledge and reasoning capabilities while ensuring the outputs conform to user demands.

Abstract

When performing reasoning tasks with user-specific requirements, such as strict output formats, large language models (LLMs) often prioritize reasoning over adherence to detailed instructions. Fine-tuning LLMs on supervised datasets to address this is impractical due to high computational costs and limited parameter access. To tackle this, we propose DICE, a lightweight framework that guides small language models (SLMs) to refine LLMs' outputs through chain-of-thought (CoT) correction. DICE decouples the process by first prompting LLMs to generate natural language responses, then using trained SLMs to analyze and refine these outputs to meet structured output specifications. This framework preserves LLMs' broad knowledge and reasoning capabilities while ensuring the outputs conform to user demands. Specifically, DICE first constructs structured CoT adaptation datasets via a two-stage method and subsequently applies a dual-tuning strategy to fine-tune SLMs for generating structured outputs in an analyze-then-answer pattern. Experiments demonstrate that DICE improves the average format accuracy and content correctness of LLM outputs by 35.4\% and 29.4\%, respectively, achieving state-of-the-art (SOTA) performance over other competitive baselines.

Paper Structure

This paper contains 35 sections, 9 equations, 7 figures, 6 tables, 1 algorithm.

Figures (7)

  • Figure 1: Structured format accuracy and unstructured output accuracy across model sizes on MATH. The models are required to generate structured output given 2-shot prompts. The bars represent content accuracy of unstructured natural language outputs, and the lines denote the format accuracy of structured outputs. More details about formats are in Appendix \ref{['sec:format']}.
  • Figure 2: Overview of DICE framework. The training process comprises two sequential phases: DICE first employs a two-stage strategy to construct structured chain-of-thought data and subsequently implements a dual-tuning methodology to optimize the SLM to enforce rigorous format compliance. During inference, the trained SLM systematically analyzes and refines the natural language outputs from the LLM.
  • Figure 3: Cross-dataset generalization ability of different methods. The 1.5B SLMs trained on GSM8K and MATH through different methods are evaluated on test sets of both benchmarks. "A$\rightarrow$B" represents models that are trained on A and tested on B.
  • Figure 4: The consistency analysis between natural language outputs from LLM and outputs in XML format from 1.5B SLMs in generative approaches. We investigate the consistency in output correctness using four evaluation metrics: Mis-correction Rate (CER), Correction Rate (ECR), Consistent Error Rate (EER), and Consistent Correct Rate (CCR). These metrics provide a comprehensive insight into the strengths of the DICE framework.
  • Figure 5: Format templates for GSM8K, MATH, and StrategyQA datasets. The symbol (\\ n) indicates the presence of a newline character at that specific position.
  • ...and 2 more figures