Table of Contents
Fetching ...

QM-ToT: A Medical Tree of Thoughts Reasoning Framework for Quantized Model

Zongxian Yang, Jiayu Qian, Zhi-An Huang, Kay Chen Tan

TL;DR

QM-ToT tackles the challenge of deploying high-performing LLMs for biomedical tasks under INT4 quantization by decomposing medical problems into branching reasoning paths and scoring them with a specialized evaluator. The framework combines a path-based Tree of Thought with a two-stage evaluation, and introduces Reflection-ToT to distill ToT-derived reasoning into longer, high-quality traces for training. On MedQA-USMLE, INT4-quantized variants of open-source models show substantial accuracy gains over CoT baselines, with QM-ToT delivering the largest improvements (e.g., notable gains for LLaMA2-70b and LLaMA3.1-8b); Reflection-ToT further boosts data efficiency, achieving large gains from a fraction of the data. These results demonstrate the viability of high-performing, resource-efficient biomedical reasoning in settings with limited computational budgets and point to future enhancements such as MCTS and RLHF to further optimize ToT-based medical reasoning.

Abstract

Large language models (LLMs) face significant challenges in specialized biomedical tasks due to the inherent complexity of medical reasoning and the sensitive nature of clinical data. Existing LLMs often struggle with intricate medical terminology and the need for accurate clinical insights, leading to performance reduction when quantized for resource-constrained deployment. To address these issues, we propose Quantized Medical Tree of Thought (QM-ToT), a path-based reasoning framework. QM-ToT leverages a Tree of Thought (ToT) reasoning approach to decompose complex medical problems into manageable subtasks, coupled with evaluator assessment layers. This framework facilitates substantial performance improvements in INT4-quantized models on the challenging MedQAUSMLE dataset. Specifically, we demonstrate a remarkable accuracy increase from 34% to 50% for the LLaMA2-70b model and from 58.77% to 69.49% for LLaMA-3.1-8b. Besides, we also proposed an effect data distillation method based on ToT. Compared to the traditional distillation method, we achieved an improvement of 86. 27% while using only 3.9% of the data.This work, for the first time, showcases the potential of ToT to significantly enhance performance on complex biomedical tasks, establishing a crucial foundation for future advances in deploying high-performing quantized LLM in resource-limited medical settings.

QM-ToT: A Medical Tree of Thoughts Reasoning Framework for Quantized Model

TL;DR

QM-ToT tackles the challenge of deploying high-performing LLMs for biomedical tasks under INT4 quantization by decomposing medical problems into branching reasoning paths and scoring them with a specialized evaluator. The framework combines a path-based Tree of Thought with a two-stage evaluation, and introduces Reflection-ToT to distill ToT-derived reasoning into longer, high-quality traces for training. On MedQA-USMLE, INT4-quantized variants of open-source models show substantial accuracy gains over CoT baselines, with QM-ToT delivering the largest improvements (e.g., notable gains for LLaMA2-70b and LLaMA3.1-8b); Reflection-ToT further boosts data efficiency, achieving large gains from a fraction of the data. These results demonstrate the viability of high-performing, resource-efficient biomedical reasoning in settings with limited computational budgets and point to future enhancements such as MCTS and RLHF to further optimize ToT-based medical reasoning.

Abstract

Large language models (LLMs) face significant challenges in specialized biomedical tasks due to the inherent complexity of medical reasoning and the sensitive nature of clinical data. Existing LLMs often struggle with intricate medical terminology and the need for accurate clinical insights, leading to performance reduction when quantized for resource-constrained deployment. To address these issues, we propose Quantized Medical Tree of Thought (QM-ToT), a path-based reasoning framework. QM-ToT leverages a Tree of Thought (ToT) reasoning approach to decompose complex medical problems into manageable subtasks, coupled with evaluator assessment layers. This framework facilitates substantial performance improvements in INT4-quantized models on the challenging MedQAUSMLE dataset. Specifically, we demonstrate a remarkable accuracy increase from 34% to 50% for the LLaMA2-70b model and from 58.77% to 69.49% for LLaMA-3.1-8b. Besides, we also proposed an effect data distillation method based on ToT. Compared to the traditional distillation method, we achieved an improvement of 86. 27% while using only 3.9% of the data.This work, for the first time, showcases the potential of ToT to significantly enhance performance on complex biomedical tasks, establishing a crucial foundation for future advances in deploying high-performing quantized LLM in resource-limited medical settings.

Paper Structure

This paper contains 18 sections, 8 equations, 9 figures, 2 tables, 1 algorithm.

Figures (9)

  • Figure 1: FP16 vs INT4 quantization performance comparison. Performance gap between FP16 and INT4 quantization for LLaMA3-70b, LLaMA2-70b, Qwen2.5-72b, and LLaMA3.1-8b models on the MedQA-USMLE dataset.
  • Figure 2: Tree-based Reasoning and Dual-Evaluation Workflow. This diagram showcases how the final score is calculated. Starting with a question, the workflow expands into a tree structure where each path can branch into multiple sub-paths, eventually leading to an answer. The answer and the chain of thought is then evaluated by two specialized evaluators - one for reasoning and one for medical fact correctness. These evaluations combine to produce a final quality score.
  • Figure 3: QM-ToT decision workflow. This workflow diagram illustrates the decision-making process of the QM-ToT framework, from initial question to final answer generation. The process begins with a question that feeds into a tree of thought system, generating different chain of thought. These chains are then evaluated by an Evaluator component, which assigns Final Scores to each path. The framework then compares two metrics: the option with the highest average score (Average Max) and the option with the highest individual Final Score (Score Max). If these two selections match, it directly determines the Final Choice. If they differ, the model initiates a re-evaluation process to reach the Final Choice.
  • Figure 4: Reflection-ToT: a data distillation method driven by ToT. Short CoT generated by ToT without reflection is refined by the Qwen2.5-72b model to produce long CoT. Correct and incorrect long CoT are randomly paired to form Direct Preference Optimization (DPO) training data.
  • Figure 5: Difficulty classification of the dataset based on CoT-SC accuracy
  • ...and 4 more figures