Table of Contents
Fetching ...

ReFIne: A Framework for Trustworthy Large Reasoning Models with Reliability, Faithfulness, and Interpretability

Chung-En Sun, Ge Yan, Akshay Kulkarni, Tsui-Wei Weng

TL;DR

ReFIne addresses the lack of trustworthy reasoning in large reasoning models by enforcing interpretability, faithfulness, and reliability in long‑form traces. It combines supervised finetuning with group relative policy optimization to shape structured traces, explicit cross‑section references, and self‑assessments with calibrated confidence. Across multiple Qwen3 model sizes and math benchmarks, ReFIne improves the three trustworthiness dimensions while maintaining or modestly improving accuracy and reducing reasoning length. This work provides a concrete, scalable approach to making extensive chain‑of‑thought reasoning more auditable and dependable in practical applications.

Abstract

Recent advances in long chain-of-thought (CoT) reasoning have largely prioritized answer accuracy and token efficiency, while overlooking aspects critical to trustworthiness. We argue that usable reasoning systems must be trustworthy, characterized by three properties: interpretability, faithfulness, and reliability. To this end, we propose ReFIne, a new training framework that integrates supervised fine-tuning with GRPO to encourage models to: (i) improve interpretability by producing structured, tag-based traces with high-level planning that are easier for humans to follow; (ii) enhance faithfulness by explicitly disclosing the decisive information guiding each solution, with consistent cross-section references; and (iii) promote reliability by providing self-assessments of both the derivation's soundness and the confidence of the final answer. We apply ReFIne to the Qwen3 models at multiple scales (1.7B/4B/8B) and evaluate across mathematical benchmarks of varying difficulty. Our experimental results show that ReFIne models generate clearer and better-structured reasoning traces (interpretability +44.0%), more faithfully expose their underlying decision process (faithfulness +18.8%), and offer informative confidence estimates (reliability +42.4%). These findings highlight an overlooked but important direction: reasoning models should be optimized not only for accuracy, but also for broader dimensions of trustworthiness. Our code is available at: https://github.com/Trustworthy-ML-Lab/Training_Trustworthy_LRM_with_Refine

ReFIne: A Framework for Trustworthy Large Reasoning Models with Reliability, Faithfulness, and Interpretability

TL;DR

ReFIne addresses the lack of trustworthy reasoning in large reasoning models by enforcing interpretability, faithfulness, and reliability in long‑form traces. It combines supervised finetuning with group relative policy optimization to shape structured traces, explicit cross‑section references, and self‑assessments with calibrated confidence. Across multiple Qwen3 model sizes and math benchmarks, ReFIne improves the three trustworthiness dimensions while maintaining or modestly improving accuracy and reducing reasoning length. This work provides a concrete, scalable approach to making extensive chain‑of‑thought reasoning more auditable and dependable in practical applications.

Abstract

Recent advances in long chain-of-thought (CoT) reasoning have largely prioritized answer accuracy and token efficiency, while overlooking aspects critical to trustworthiness. We argue that usable reasoning systems must be trustworthy, characterized by three properties: interpretability, faithfulness, and reliability. To this end, we propose ReFIne, a new training framework that integrates supervised fine-tuning with GRPO to encourage models to: (i) improve interpretability by producing structured, tag-based traces with high-level planning that are easier for humans to follow; (ii) enhance faithfulness by explicitly disclosing the decisive information guiding each solution, with consistent cross-section references; and (iii) promote reliability by providing self-assessments of both the derivation's soundness and the confidence of the final answer. We apply ReFIne to the Qwen3 models at multiple scales (1.7B/4B/8B) and evaluate across mathematical benchmarks of varying difficulty. Our experimental results show that ReFIne models generate clearer and better-structured reasoning traces (interpretability +44.0%), more faithfully expose their underlying decision process (faithfulness +18.8%), and offer informative confidence estimates (reliability +42.4%). These findings highlight an overlooked but important direction: reasoning models should be optimized not only for accuracy, but also for broader dimensions of trustworthiness. Our code is available at: https://github.com/Trustworthy-ML-Lab/Training_Trustworthy_LRM_with_Refine

Paper Structure

This paper contains 41 sections, 7 equations, 8 figures, 6 tables, 1 algorithm.

Figures (8)

  • Figure 1: Comparison between standard LRMs and our ReFIne framework, showing improvements in interpretability, faithfulness, and reliability while maintaining accuracy and efficiency.
  • Figure 2: Pairwise readability comparison across all datasets, judged by QwQ-32B. ReFIne is consistently judged to produce reasoning that is clearer and easier to follow.
  • Figure 3: Accuracy across benchmarks. Error bars denote standard deviation across runs.
  • Figure 4: Reasoning length (tokens; lower is better).
  • Figure 5: ReFIne (right) vs. Plain (left) on GSM8K. The long reasoning (<think>) segments are truncated due to page space limitations.
  • ...and 3 more figures