Table of Contents
Fetching ...

Deduplicating and Ranking Solution Programs for Suggesting Reference Solutions

Atsushi Shirafuji, Yutaka Watanobe

TL;DR

This work tackles the learner burden caused by thousands of duplicate solution programs in online judges by proposing a deduplication and ranking pipeline that normalizes code, removes duplicates, and ranks remaining solutions by their popularity. Using AOJ's Intro to Programming problems, it shows a substantial reduction in available programs (about 60.2% on average) and that top-10 deduplicated references cover around 29.95% of all solutions, enabling learners to grasp diverse approaches with relatively few references. Qualitative evaluation indicates the suggested references are typically readable and helpful, though normalization can reduce readability and some near-duplicates persist. The approach has practical potential to improve learning outcomes and could be integrated into real online judge systems to surface representative reference solutions while reducing learner burden.

Abstract

Referring to solution programs written by other users is helpful for learners in programming education. However, current online judge systems just list all solution programs submitted by users for references, and the programs are sorted based on the submission date and time, execution time, or user rating, ignoring to what extent the programs can be helpful to be referenced. In addition, users struggle to refer to a variety of solution approaches since there are too many duplicated and near-duplicated programs. To motivate learners to refer to various solutions to learn better solution approaches, in this paper, we propose an approach to deduplicate and rank common solution programs in each programming problem. Inspired by the nature that the many-duplicated program adopts a more common approach and can be a general reference, we remove the near-duplicated solution programs and rank the unique programs based on the duplicate count. The experiments on the solution programs submitted to a real-world online judge system demonstrate that the number of programs is reduced by 60.20%, whereas the baseline only reduces by 29.59% after the deduplication, meaning that users only need to refer to 39.80% of programs on average. Furthermore, our analysis shows that top-10 ranked programs cover 29.95% of programs on average, indicating that users can grasp 29.95% of solution approaches by referring to only 10 programs. The proposed approach shows the potential of reducing the learners' burden of referring to too many solutions and motivating them to learn a variety of solution approaches.

Deduplicating and Ranking Solution Programs for Suggesting Reference Solutions

TL;DR

This work tackles the learner burden caused by thousands of duplicate solution programs in online judges by proposing a deduplication and ranking pipeline that normalizes code, removes duplicates, and ranks remaining solutions by their popularity. Using AOJ's Intro to Programming problems, it shows a substantial reduction in available programs (about 60.2% on average) and that top-10 deduplicated references cover around 29.95% of all solutions, enabling learners to grasp diverse approaches with relatively few references. Qualitative evaluation indicates the suggested references are typically readable and helpful, though normalization can reduce readability and some near-duplicates persist. The approach has practical potential to improve learning outcomes and could be integrated into real online judge systems to surface representative reference solutions while reducing learner burden.

Abstract

Referring to solution programs written by other users is helpful for learners in programming education. However, current online judge systems just list all solution programs submitted by users for references, and the programs are sorted based on the submission date and time, execution time, or user rating, ignoring to what extent the programs can be helpful to be referenced. In addition, users struggle to refer to a variety of solution approaches since there are too many duplicated and near-duplicated programs. To motivate learners to refer to various solutions to learn better solution approaches, in this paper, we propose an approach to deduplicate and rank common solution programs in each programming problem. Inspired by the nature that the many-duplicated program adopts a more common approach and can be a general reference, we remove the near-duplicated solution programs and rank the unique programs based on the duplicate count. The experiments on the solution programs submitted to a real-world online judge system demonstrate that the number of programs is reduced by 60.20%, whereas the baseline only reduces by 29.59% after the deduplication, meaning that users only need to refer to 39.80% of programs on average. Furthermore, our analysis shows that top-10 ranked programs cover 29.95% of programs on average, indicating that users can grasp 29.95% of solution approaches by referring to only 10 programs. The proposed approach shows the potential of reducing the learners' burden of referring to too many solutions and motivating them to learn a variety of solution approaches.
Paper Structure (22 sections, 5 figures, 2 tables)

This paper contains 22 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Illustration of the proposed approach.
  • Figure 2: Illustration of normalization, which is divided into four phases: tokenization, anonymization, detokenization, and formatting.
  • Figure 3: The comparison between the baseline and our proposed approaches of the ratio of unique programs out of solution programs. The smaller the better.
  • Figure 4: Average solution coverage on top-$n$ ranked programs on 44 problems. The higher the better.
  • Figure 5: Examples of top-5 programs for ITP1_3_A.