Table of Contents
Fetching ...

RefactorCoderQA: Benchmarking LLMs for Multi-Domain Coding Question Solutions in Cloud and Edge Deployment

Shadikur Rahman, Aroosa Hameed, Gautam Srivastava, Syed Muhammad Danish

TL;DR

The paper tackles the challenge of evaluating LLMs on realistic, multi-domain coding tasks by introducing RefactorCoderQA, a Stack Overflow–based benchmark spanning SE, DS, ML, and NLP, and a cloud-edge multi-agent framework (GuideLLM, SolverLLM, JudgeLLM). It leverages a fine-tuned RefactorCoder-MoE model trained with QLoRA to provide structured guidance and code, with automated evaluation from GPT-4o. Across extensive experiments, RefactorCoder-MoE achieves 76.84% overall accuracy and outperforms both open-source and closed-source baselines, demonstrating the value of explicit task decomposition and domain-focused fine-tuning. Limitations include potential data leakage, moderate dataset size, and latency overhead, guiding future work toward broader domains, improved reasoning modules, and latency optimization.

Abstract

To optimize the reasoning and problem-solving capabilities of Large Language Models (LLMs), we propose a novel cloud-edge collaborative architecture that enables a structured multi-agent prompting framework. This framework comprises three specialized components: GuideLLM, a lightweight model deployed at the edge to provide methodological guidance; SolverLLM, a more powerful model hosted in the cloud and responsible for generating code solutions; and JudgeLLM, an automated evaluator for assessing solution correctness and quality. To evaluate and demonstrate the effectiveness of this architecture in realistic settings, we introduce RefactorCoderQA, a comprehensive benchmark designed to evaluate and enhance the performance of LLMs across multi-domain coding tasks. Motivated by the limitations of existing benchmarks, RefactorCoderQA systematically covers multiple technical domains, including Software Engineering, Data Science, Machine Learning, and Natural Language Processing, using authentic coding challenges sourced from Stack Overflow. We propose RefactorCoder-MoE, a fine-tuned mixture-of-experts (MoE) code language model based on DeepSeek-Coder-7B-Instruct, adapted to the RefactorCoderQA benchmark using QLoRA for domain-specific coding question answering. Extensive experiments demonstrate that RefactorCoder-MoE achieves strong and competitive performance, significantly outperforming all evaluated open-source and commercial baselines, with an overall accuracy of 76.84%.

RefactorCoderQA: Benchmarking LLMs for Multi-Domain Coding Question Solutions in Cloud and Edge Deployment

TL;DR

The paper tackles the challenge of evaluating LLMs on realistic, multi-domain coding tasks by introducing RefactorCoderQA, a Stack Overflow–based benchmark spanning SE, DS, ML, and NLP, and a cloud-edge multi-agent framework (GuideLLM, SolverLLM, JudgeLLM). It leverages a fine-tuned RefactorCoder-MoE model trained with QLoRA to provide structured guidance and code, with automated evaluation from GPT-4o. Across extensive experiments, RefactorCoder-MoE achieves 76.84% overall accuracy and outperforms both open-source and closed-source baselines, demonstrating the value of explicit task decomposition and domain-focused fine-tuning. Limitations include potential data leakage, moderate dataset size, and latency overhead, guiding future work toward broader domains, improved reasoning modules, and latency optimization.

Abstract

To optimize the reasoning and problem-solving capabilities of Large Language Models (LLMs), we propose a novel cloud-edge collaborative architecture that enables a structured multi-agent prompting framework. This framework comprises three specialized components: GuideLLM, a lightweight model deployed at the edge to provide methodological guidance; SolverLLM, a more powerful model hosted in the cloud and responsible for generating code solutions; and JudgeLLM, an automated evaluator for assessing solution correctness and quality. To evaluate and demonstrate the effectiveness of this architecture in realistic settings, we introduce RefactorCoderQA, a comprehensive benchmark designed to evaluate and enhance the performance of LLMs across multi-domain coding tasks. Motivated by the limitations of existing benchmarks, RefactorCoderQA systematically covers multiple technical domains, including Software Engineering, Data Science, Machine Learning, and Natural Language Processing, using authentic coding challenges sourced from Stack Overflow. We propose RefactorCoder-MoE, a fine-tuned mixture-of-experts (MoE) code language model based on DeepSeek-Coder-7B-Instruct, adapted to the RefactorCoderQA benchmark using QLoRA for domain-specific coding question answering. Extensive experiments demonstrate that RefactorCoder-MoE achieves strong and competitive performance, significantly outperforming all evaluated open-source and commercial baselines, with an overall accuracy of 76.84%.

Paper Structure

This paper contains 30 sections, 3 equations, 5 figures, 13 tables.

Figures (5)

  • Figure 1: LLM raw response to the query
  • Figure 2: Overview of the RefactorCoderQA agentic framework. The process begins with a problem statement and flows through three stages: GuideLLM (methodology generation), SolverLLM (solution synthesis), and JudgeLLM (automated evaluation).
  • Figure 3: Overview of the RefactorCoder Agentic Framework Workflow: A multi-agent framework that processes coding-related problem statements through three coordinated stages, GuideLLM for structured methodology generation, SolverLLM for executable code synthesis, and JudgeLLM for automated evaluation across accuracy, clarity, and efficiency dimensions.
  • Figure 4: Accuracy Across Domains: Closed-Source Models vs RefactorCoder-MoE
  • Figure 5: Accuracy Across Domains: Open-Source Models vs RefactorCoder-MoE