Table of Contents
Fetching ...

RECSIP: REpeated Clustering of Scores Improving the Precision

André Schamschurko, Nenad Petrovic, Alois Christian Knoll

TL;DR

The paper tackles the reliability of Large Language Models by addressing their stochastic outputs with a novel framework, RECSIP, which prompts multiple models in parallel, scores and clusters their responses, and uses a callback loop to converge on a trustworthy answer. By combining ideas from Self-Consistency and multiagent debate, RECSIP avoids relying on a single evaluator and instead uses cross-model agreement to increase precision. Evaluated on the MMLU-Pro benchmark with GPT-4o, Claude, and Gemini, RECSIP achieves a gain of $5.8$ percentage points over the best single model and outperforms the leaderboard, highlighting its potential for safer, more reliable AI-assisted decision-making. The approach trades some additional computation for substantially higher precision, with future work focusing on weighting model strengths, enhancing similarity scoring, and integrating RECSIP into automated toolchains for industrial use.

Abstract

The latest research on Large Language Models (LLMs) has demonstrated significant advancement in the field of Natural Language Processing (NLP). However, despite this progress, there is still a lack of reliability in these models. This is due to the stochastic architecture of LLMs, which presents a challenge for users attempting to ascertain the reliability of a model's response. These responses may cause serious harm in high-risk environments or expensive failures in industrial contexts. Therefore, we introduce the framework REpeated Clustering of Scores Improving the Precision (RECSIP) which focuses on improving the precision of LLMs by asking multiple models in parallel, scoring and clustering their responses to ensure a higher reliability on the response. The evaluation of our reference implementation recsip on the benchmark MMLU-Pro using the models GPT-4o, Claude and Gemini shows an overall increase of 5.8 per cent points compared to the best used model.

RECSIP: REpeated Clustering of Scores Improving the Precision

TL;DR

The paper tackles the reliability of Large Language Models by addressing their stochastic outputs with a novel framework, RECSIP, which prompts multiple models in parallel, scores and clusters their responses, and uses a callback loop to converge on a trustworthy answer. By combining ideas from Self-Consistency and multiagent debate, RECSIP avoids relying on a single evaluator and instead uses cross-model agreement to increase precision. Evaluated on the MMLU-Pro benchmark with GPT-4o, Claude, and Gemini, RECSIP achieves a gain of percentage points over the best single model and outperforms the leaderboard, highlighting its potential for safer, more reliable AI-assisted decision-making. The approach trades some additional computation for substantially higher precision, with future work focusing on weighting model strengths, enhancing similarity scoring, and integrating RECSIP into automated toolchains for industrial use.

Abstract

The latest research on Large Language Models (LLMs) has demonstrated significant advancement in the field of Natural Language Processing (NLP). However, despite this progress, there is still a lack of reliability in these models. This is due to the stochastic architecture of LLMs, which presents a challenge for users attempting to ascertain the reliability of a model's response. These responses may cause serious harm in high-risk environments or expensive failures in industrial contexts. Therefore, we introduce the framework REpeated Clustering of Scores Improving the Precision (RECSIP) which focuses on improving the precision of LLMs by asking multiple models in parallel, scoring and clustering their responses to ensure a higher reliability on the response. The evaluation of our reference implementation recsip on the benchmark MMLU-Pro using the models GPT-4o, Claude and Gemini shows an overall increase of 5.8 per cent points compared to the best used model.

Paper Structure

This paper contains 16 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Schema of the RECSIP implementation
  • Figure 2: Distribution of the reason for wrong responses in Biology
  • Figure 3: recsip response wrongly interpreted by the benchmark as B