Table of Contents
Fetching ...

One for All: A General Framework of LLMs-based Multi-Criteria Decision Making on Human Expert Level

Hui Wang, Fafa Zhang, Chaoxu Mu

TL;DR

The study addresses the bottleneck of traditional MCDM in high-dimensional settings by proposing a general LLM-based evaluation framework that combines AHP-FCE with modern prompting techniques and LoRA-based fine-tuning. It systematically compares API and open-source LLMs across three applications (supplier evaluation, customer satisfaction, and air quality) using zero-shot, few-shot, and Chain-of-Thought prompting, demonstrating significant performance gains when incorporating CoT and few-shot prompts, and reaching human-expert-level accuracy with LoRA fine-tuning (~95%). The results show that base LLMs lag behind expert evaluations, but carefully designed prompts plus lightweight domain adaptation can close much of this gap, with performance differences between models becoming negligible after fine-tuning. This work provides a practical path toward automated, scalable MCDM support and highlights the potential for deploying LLM-driven decision support across diverse domains, while outlining future work to extend to more tasks and datasets. $W_i=(w_1, w_2, ..., w_n)$ and $R$ matrices from the AHP-FCE framework are leveraged within the LLM-based evaluation to quantify criteria weights and fuzzy ratings, illustrating the integration of traditional MCDM with language-model reasoning.$

Abstract

Multi-Criteria Decision Making~(MCDM) is widely applied in various fields, using quantitative and qualitative analyses of multiple levels and attributes to support decision makers in making scientific and rational decisions in complex scenarios. However, traditional MCDM methods face bottlenecks in high-dimensional problems. Given the fact that Large Language Models~(LLMs) achieve impressive performance in various complex tasks, but limited work evaluates LLMs in specific MCDM problems with the help of human domain experts, we further explore the capability of LLMs by proposing an LLM-based evaluation framework to automatically deal with general complex MCDM problems. Within the framework, we assess the performance of various typical open-source models, as well as commercial models such as Claude and ChatGPT, on 3 important applications, these models can only achieve around 60\% accuracy rate compared to the evaluation ground truth. Upon incorporation of Chain-of-Thought or few-shot prompting, the accuracy rates rise to around 70\%, and highly depend on the model. In order to further improve the performance, a LoRA-based fine-tuning technique is employed. The experimental results show that the accuracy rates for different applications improve significantly to around 95\%, and the performance difference is trivial between different models, indicating that LoRA-based fine-tuned LLMs exhibit significant and stable advantages in addressing MCDM tasks and can provide human-expert-level solutions to a wide range of MCDM challenges.

One for All: A General Framework of LLMs-based Multi-Criteria Decision Making on Human Expert Level

TL;DR

The study addresses the bottleneck of traditional MCDM in high-dimensional settings by proposing a general LLM-based evaluation framework that combines AHP-FCE with modern prompting techniques and LoRA-based fine-tuning. It systematically compares API and open-source LLMs across three applications (supplier evaluation, customer satisfaction, and air quality) using zero-shot, few-shot, and Chain-of-Thought prompting, demonstrating significant performance gains when incorporating CoT and few-shot prompts, and reaching human-expert-level accuracy with LoRA fine-tuning (~95%). The results show that base LLMs lag behind expert evaluations, but carefully designed prompts plus lightweight domain adaptation can close much of this gap, with performance differences between models becoming negligible after fine-tuning. This work provides a practical path toward automated, scalable MCDM support and highlights the potential for deploying LLM-driven decision support across diverse domains, while outlining future work to extend to more tasks and datasets. and matrices from the AHP-FCE framework are leveraged within the LLM-based evaluation to quantify criteria weights and fuzzy ratings, illustrating the integration of traditional MCDM with language-model reasoning.$

Abstract

Multi-Criteria Decision Making~(MCDM) is widely applied in various fields, using quantitative and qualitative analyses of multiple levels and attributes to support decision makers in making scientific and rational decisions in complex scenarios. However, traditional MCDM methods face bottlenecks in high-dimensional problems. Given the fact that Large Language Models~(LLMs) achieve impressive performance in various complex tasks, but limited work evaluates LLMs in specific MCDM problems with the help of human domain experts, we further explore the capability of LLMs by proposing an LLM-based evaluation framework to automatically deal with general complex MCDM problems. Within the framework, we assess the performance of various typical open-source models, as well as commercial models such as Claude and ChatGPT, on 3 important applications, these models can only achieve around 60\% accuracy rate compared to the evaluation ground truth. Upon incorporation of Chain-of-Thought or few-shot prompting, the accuracy rates rise to around 70\%, and highly depend on the model. In order to further improve the performance, a LoRA-based fine-tuning technique is employed. The experimental results show that the accuracy rates for different applications improve significantly to around 95\%, and the performance difference is trivial between different models, indicating that LoRA-based fine-tuned LLMs exhibit significant and stable advantages in addressing MCDM tasks and can provide human-expert-level solutions to a wide range of MCDM challenges.

Paper Structure

This paper contains 14 sections, 14 figures, 3 tables, 1 algorithm.

Figures (14)

  • Figure 1: Overview of the proposed LLMs-based MCDM Framework. (a) The API models and the LoRA fine-tuned models are evaluated separately for MCDM. (b) Comparative analysis of the two sets of evaluation results. (c) Conducting MCDM evaluation combining traditional Models.
  • Figure 2: Performance comparison of different open-source models in three datasets.
  • Figure 3: LoRA Qwen2-7B
  • Figure 4: LoRA ChatGLM4-9B
  • Figure 5: LoRA Llama3-8B
  • ...and 9 more figures