Table of Contents
Fetching ...

CodingTeachLLM: Empowering LLM's Coding Ability via AST Prior Knowledge

Zhangquan Chen, Chunjiang Liu, Haobin Duan

TL;DR

CodingTeachLLM tackles the challenge of making LLMs effective coding tutors by combining a three-phase LoRA-based fine-tuning strategy with a principled prior module that grounds the model in code structure and educational pedagogy. The system preprocesses data with an overlap-estimation network, segments tasks using an AST-based approach, and retrieves correlated priors from a vector database to guide incremental, step-by-step tutoring. Empirical results show state-of-the-art code generation on HumanEval (pass@1 = 75.10%) and strong multilingual dialogue performance (MMLU 56.34, C-Eval 50.60, AGIEval 45.27) for a 13B-parameter model, while ablations demonstrate the pivotal roles of the three fine-tuning phases and the prior/module design. The proposed framework yields robust tutoring behavior, with potential applicability to other vertical domains beyond coding education.

Abstract

In this paper, we introduce CodingTeachLLM, a large language model (LLM) designed for coding teaching. Specially, we aim to enhance the coding ability of LLM and lead it to better teaching mode in education context. Thus, we propose an end-to-end prior-based three-phases supervised fine-tuned model, which is proved more competitive than traditional fine-tuning method. More specifically, our model realizes the structural disassembly and incremental guided output of educational knowledge. To this end, we robustify data classification of three types via a sampler and overlap estimation neural network, and inject the preprocessing datasets into pre-trained model in three batches for LORA fine-tuning. Then, we design a prior module couples system prompt, vector databases, and abstract syntax tree task segmentation. Finally, the compression method and regularization constraint are applied to the prior-based fine-tuned model, followed by text filter at the output end to obtain incremental guided results. Our model represents the first research effort to truly embody the tutor role with the features of abundant educational knowledge, step-by-step incremental guided outputs and non-disclosure of answers. Extensive experiments report that our model also achieves state-of-the-art in code abilities compared to open-source models, reaching an impressive 75.10% on the HumanEval (@pass 1) benchmark. Additionally, our model maintains strong conversational capabilities, with the 13B quantized version achieving scores of 56.34, 50.60, and 45.27 respectively on the MMLU, C-Eval, and AGIEval (5 shot) dialogue evaluation benchmarks.

CodingTeachLLM: Empowering LLM's Coding Ability via AST Prior Knowledge

TL;DR

CodingTeachLLM tackles the challenge of making LLMs effective coding tutors by combining a three-phase LoRA-based fine-tuning strategy with a principled prior module that grounds the model in code structure and educational pedagogy. The system preprocesses data with an overlap-estimation network, segments tasks using an AST-based approach, and retrieves correlated priors from a vector database to guide incremental, step-by-step tutoring. Empirical results show state-of-the-art code generation on HumanEval (pass@1 = 75.10%) and strong multilingual dialogue performance (MMLU 56.34, C-Eval 50.60, AGIEval 45.27) for a 13B-parameter model, while ablations demonstrate the pivotal roles of the three fine-tuning phases and the prior/module design. The proposed framework yields robust tutoring behavior, with potential applicability to other vertical domains beyond coding education.

Abstract

In this paper, we introduce CodingTeachLLM, a large language model (LLM) designed for coding teaching. Specially, we aim to enhance the coding ability of LLM and lead it to better teaching mode in education context. Thus, we propose an end-to-end prior-based three-phases supervised fine-tuned model, which is proved more competitive than traditional fine-tuning method. More specifically, our model realizes the structural disassembly and incremental guided output of educational knowledge. To this end, we robustify data classification of three types via a sampler and overlap estimation neural network, and inject the preprocessing datasets into pre-trained model in three batches for LORA fine-tuning. Then, we design a prior module couples system prompt, vector databases, and abstract syntax tree task segmentation. Finally, the compression method and regularization constraint are applied to the prior-based fine-tuned model, followed by text filter at the output end to obtain incremental guided results. Our model represents the first research effort to truly embody the tutor role with the features of abundant educational knowledge, step-by-step incremental guided outputs and non-disclosure of answers. Extensive experiments report that our model also achieves state-of-the-art in code abilities compared to open-source models, reaching an impressive 75.10% on the HumanEval (@pass 1) benchmark. Additionally, our model maintains strong conversational capabilities, with the 13B quantized version achieving scores of 56.34, 50.60, and 45.27 respectively on the MMLU, C-Eval, and AGIEval (5 shot) dialogue evaluation benchmarks.
Paper Structure (20 sections, 4 equations, 2 figures, 4 tables)

This paper contains 20 sections, 4 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Overview of our method. End-to-End: A supervised learning network coupled peripheral algorithm modules: data segmentation and distillation algorithm via powerful LLM, overlap estimation neural network used for data pre-processing, three phases fine-tuning of seq2seq language model, supervised model fine-tuning under efficient regularization constraints, prior module combining AST and vector database, better designed filter for dynamic noise.
  • Figure 2: Heat map of oure data preprocessing results via overlap network. 10 samples were randomly selected, where M represents MFT-dataset and L represents Local Knowledge.