Table of Contents
Fetching ...

Confident RAG: Enhancing the Performance of LLMs for Mathematics Question Answering through Multi-Embedding and Confidence Scoring

Shiting Chen, Zijian Zhao, Jinsong Chen

TL;DR

This paper addresses the instability of Retrieval-Augmented Generation (RAG) in mathematical problem solving caused by embedding-model variance. It introduces two approaches—Mixture-Embedding RAG and Confident RAG—with Confident RAG outperforming vanilla LLMs and vanilla RAG by about 10% on average, by generating multiple answers from different embeddings and selecting the most confident one. The study shows that confidence metrics like Self-Certainty and Distributional Perplexity are particularly effective, and that three embedding models strike a practical balance between performance and cost. It also discusses the potential to extend Confident RAG into fully autonomous Agentic RAG systems for educational settings and outlines limitations and directions for future cross-domain validation and deployment.

Abstract

Large Language Models (LLMs) hold significant promise for mathematics education, yet they often struggle with complex mathematical reasoning. While Retrieval-Augmented Generation (RAG) mitigates these issues by grounding LLMs in external knowledge, its effectiveness remains unstable, heavily dependent on the choice of a single embedding model. Moving beyond static RAG workflows, we draw on agentic workflow patterns, a paradigm that introduces structured task decomposition and collaboration to enhance system performance. We propose and examine two novel approaches that combine the benefits of multiple embedding models. While our Mixture-Embedding RAG approach (fusing retrieved documents) shows limited gains, our Confident RAG method (generating multiple answers and selecting the one with the highest confidence score) demonstrates significant improvement. Experimental results show that Confident RAG achieved average accuracy improvements of approximately 10% over vanilla LLMs and 5% over vanilla RAG. The consistent results across different LLMs and embedding models indicate that Confident RAG is an efficient plug-and-play solution for trustworthy mathematical AI assistants. Finally, we discuss how this work lays the groundwork for deploying Agentic RAG systems in educational settings, where autonomous planning and iterative refinement can be built upon our robust retrieval foundation.

Confident RAG: Enhancing the Performance of LLMs for Mathematics Question Answering through Multi-Embedding and Confidence Scoring

TL;DR

This paper addresses the instability of Retrieval-Augmented Generation (RAG) in mathematical problem solving caused by embedding-model variance. It introduces two approaches—Mixture-Embedding RAG and Confident RAG—with Confident RAG outperforming vanilla LLMs and vanilla RAG by about 10% on average, by generating multiple answers from different embeddings and selecting the most confident one. The study shows that confidence metrics like Self-Certainty and Distributional Perplexity are particularly effective, and that three embedding models strike a practical balance between performance and cost. It also discusses the potential to extend Confident RAG into fully autonomous Agentic RAG systems for educational settings and outlines limitations and directions for future cross-domain validation and deployment.

Abstract

Large Language Models (LLMs) hold significant promise for mathematics education, yet they often struggle with complex mathematical reasoning. While Retrieval-Augmented Generation (RAG) mitigates these issues by grounding LLMs in external knowledge, its effectiveness remains unstable, heavily dependent on the choice of a single embedding model. Moving beyond static RAG workflows, we draw on agentic workflow patterns, a paradigm that introduces structured task decomposition and collaboration to enhance system performance. We propose and examine two novel approaches that combine the benefits of multiple embedding models. While our Mixture-Embedding RAG approach (fusing retrieved documents) shows limited gains, our Confident RAG method (generating multiple answers and selecting the one with the highest confidence score) demonstrates significant improvement. Experimental results show that Confident RAG achieved average accuracy improvements of approximately 10% over vanilla LLMs and 5% over vanilla RAG. The consistent results across different LLMs and embedding models indicate that Confident RAG is an efficient plug-and-play solution for trustworthy mathematical AI assistants. Finally, we discuss how this work lays the groundwork for deploying Agentic RAG systems in educational settings, where autonomous planning and iterative refinement can be built upon our robust retrieval foundation.

Paper Structure

This paper contains 22 sections, 6 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Workflow: This figure illustrates the workflows of vanilla RAG, as well as our proposed Mixture-Embedding RAG and Confident RAG.
  • Figure 2: CDF of Accuracy for Different Metrics: The lines have been smoothed using a Gaussian filter.