Table of Contents
Fetching ...

Through the Thicket: A Study of Number-Oriented LLMs derived from Random Forest Models

Michał Romaszewski, Przemysław Sekuła, Przemysław Głomb, Michał Cholewa, Katarzyna Kołodziej

TL;DR

This work introduces a novel method to train large language models by transferring knowledge from a random forest ensemble. By converting RF decision paths into natural-language statements and sampling multiple trees, the authors generate training data that enables LLMs to both classify numerical data and produce explainable justifications, validated by a rule-based precision/recall framework. The preprocessing steps—Integer Normalisation, Verbal Description of Values, and Relation Encoding—significantly improve the correctness and parsability of LLM-generated explanations, with LoRA-based fine-tuning of a FLAN-T5-base model enabling efficient adaptation. Overall, the approach achieves high labeling accuracy across Iris, Wine, and Breast Cancer datasets and demonstrates promising, verifiable explanations suitable for explainable AI applications and ensemble-model interpretability.

Abstract

Large Language Models (LLMs) have shown exceptional performance in text processing. Notably, LLMs can synthesize information from large datasets and explain their decisions similarly to human reasoning through a chain of thought (CoT). An emerging application of LLMs is the handling and interpreting of numerical data, where fine-tuning enhances their performance over basic inference methods. This paper proposes a novel approach to training LLMs using knowledge transfer from a random forest (RF) ensemble, leveraging its efficiency and accuracy. By converting RF decision paths into natural language statements, we generate outputs for LLM fine-tuning, enhancing the model's ability to classify and explain its decisions. Our method includes verifying these rules through established classification metrics, ensuring their correctness. We also examine the impact of preprocessing techniques on the representation of numerical data and their influence on classification accuracy and rule correctness

Through the Thicket: A Study of Number-Oriented LLMs derived from Random Forest Models

TL;DR

This work introduces a novel method to train large language models by transferring knowledge from a random forest ensemble. By converting RF decision paths into natural-language statements and sampling multiple trees, the authors generate training data that enables LLMs to both classify numerical data and produce explainable justifications, validated by a rule-based precision/recall framework. The preprocessing steps—Integer Normalisation, Verbal Description of Values, and Relation Encoding—significantly improve the correctness and parsability of LLM-generated explanations, with LoRA-based fine-tuning of a FLAN-T5-base model enabling efficient adaptation. Overall, the approach achieves high labeling accuracy across Iris, Wine, and Breast Cancer datasets and demonstrates promising, verifiable explanations suitable for explainable AI applications and ensemble-model interpretability.

Abstract

Large Language Models (LLMs) have shown exceptional performance in text processing. Notably, LLMs can synthesize information from large datasets and explain their decisions similarly to human reasoning through a chain of thought (CoT). An emerging application of LLMs is the handling and interpreting of numerical data, where fine-tuning enhances their performance over basic inference methods. This paper proposes a novel approach to training LLMs using knowledge transfer from a random forest (RF) ensemble, leveraging its efficiency and accuracy. By converting RF decision paths into natural language statements, we generate outputs for LLM fine-tuning, enhancing the model's ability to classify and explain its decisions. Our method includes verifying these rules through established classification metrics, ensuring their correctness. We also examine the impact of preprocessing techniques on the representation of numerical data and their influence on classification accuracy and rule correctness
Paper Structure (24 sections, 7 equations, 2 tables)