RefactorCoderQA: Benchmarking LLMs for Multi-Domain Coding Question Solutions in Cloud and Edge Deployment
Shadikur Rahman, Aroosa Hameed, Gautam Srivastava, Syed Muhammad Danish
TL;DR
The paper tackles the challenge of evaluating LLMs on realistic, multi-domain coding tasks by introducing RefactorCoderQA, a Stack Overflow–based benchmark spanning SE, DS, ML, and NLP, and a cloud-edge multi-agent framework (GuideLLM, SolverLLM, JudgeLLM). It leverages a fine-tuned RefactorCoder-MoE model trained with QLoRA to provide structured guidance and code, with automated evaluation from GPT-4o. Across extensive experiments, RefactorCoder-MoE achieves 76.84% overall accuracy and outperforms both open-source and closed-source baselines, demonstrating the value of explicit task decomposition and domain-focused fine-tuning. Limitations include potential data leakage, moderate dataset size, and latency overhead, guiding future work toward broader domains, improved reasoning modules, and latency optimization.
Abstract
To optimize the reasoning and problem-solving capabilities of Large Language Models (LLMs), we propose a novel cloud-edge collaborative architecture that enables a structured multi-agent prompting framework. This framework comprises three specialized components: GuideLLM, a lightweight model deployed at the edge to provide methodological guidance; SolverLLM, a more powerful model hosted in the cloud and responsible for generating code solutions; and JudgeLLM, an automated evaluator for assessing solution correctness and quality. To evaluate and demonstrate the effectiveness of this architecture in realistic settings, we introduce RefactorCoderQA, a comprehensive benchmark designed to evaluate and enhance the performance of LLMs across multi-domain coding tasks. Motivated by the limitations of existing benchmarks, RefactorCoderQA systematically covers multiple technical domains, including Software Engineering, Data Science, Machine Learning, and Natural Language Processing, using authentic coding challenges sourced from Stack Overflow. We propose RefactorCoder-MoE, a fine-tuned mixture-of-experts (MoE) code language model based on DeepSeek-Coder-7B-Instruct, adapted to the RefactorCoderQA benchmark using QLoRA for domain-specific coding question answering. Extensive experiments demonstrate that RefactorCoder-MoE achieves strong and competitive performance, significantly outperforming all evaluated open-source and commercial baselines, with an overall accuracy of 76.84%.
