Table of Contents
Fetching ...

Discovering and exploring cases of educational source code plagiarism with Dolos

Rien Maertens, Maarten Van Neyghem, Maxiem Geldhof, Charlotte Van Petegem, Niko Strijbol, Peter Dawyndt, Bart Mesuere

TL;DR

Problem: detect and prevent source code plagiarism in education at scale. Approach: Dolos 2.x provides a browser-based, installation-free web app with a language-agnostic pipeline that uses CSTs from tree-sitter, a winnowing-based fingerprinting on $k$-grams with per-$window$ sampling $w$, and automatic threshold inference from a two-Gaussian model of global similarities. Contributions: redesigned hierarchical dashboards that zoom from collection to clusters to pairs/files, language-parsers bundled as a reusable component, and deployment options via API, JavaScript library, CLI, and Docker. Significance: open-source, widely adopted in academia and industry, and applicable to other code-similarity tasks such as malware analysis and copyright enforcement.

Abstract

Source code plagiarism is a significant issue in educational practice, and educators need user-friendly tools to cope with such academic dishonesty. This article introduces the latest version of Dolos, a state-of-the-art ecosystem of tools for detecting and preventing plagiarism in educational source code. In this new version, the primary focus has been on enhancing the user experience. Educators can now run the entire plagiarism detection pipeline from a new web app in their browser, eliminating the need for any installation or configuration. Completely redesigned analytics dashboards provide an instant assessment of whether a collection of source files contains suspected cases of plagiarism and how widespread plagiarism is within the collection. The dashboards support hierarchically structured navigation to facilitate zooming in and out of suspect cases. Clusters are an essential new component of the dashboard design, reflecting the observation that plagiarism can occur among larger groups of students. To meet various user needs, the Dolos software stack for source code plagiarism detections now includes a web interface, a JSON application programming interface (API), a command line interface (CLI), a JavaScript library and a preconfigured Docker container. Clear documentation and a free-to-use instance of the web app can be found at https://dolos.ugent.be. The source code is also available on GitHub.

Discovering and exploring cases of educational source code plagiarism with Dolos

TL;DR

Problem: detect and prevent source code plagiarism in education at scale. Approach: Dolos 2.x provides a browser-based, installation-free web app with a language-agnostic pipeline that uses CSTs from tree-sitter, a winnowing-based fingerprinting on -grams with per- sampling , and automatic threshold inference from a two-Gaussian model of global similarities. Contributions: redesigned hierarchical dashboards that zoom from collection to clusters to pairs/files, language-parsers bundled as a reusable component, and deployment options via API, JavaScript library, CLI, and Docker. Significance: open-source, widely adopted in academia and industry, and applicable to other code-similarity tasks such as malware analysis and copyright enforcement.

Abstract

Source code plagiarism is a significant issue in educational practice, and educators need user-friendly tools to cope with such academic dishonesty. This article introduces the latest version of Dolos, a state-of-the-art ecosystem of tools for detecting and preventing plagiarism in educational source code. In this new version, the primary focus has been on enhancing the user experience. Educators can now run the entire plagiarism detection pipeline from a new web app in their browser, eliminating the need for any installation or configuration. Completely redesigned analytics dashboards provide an instant assessment of whether a collection of source files contains suspected cases of plagiarism and how widespread plagiarism is within the collection. The dashboards support hierarchically structured navigation to facilitate zooming in and out of suspect cases. Clusters are an essential new component of the dashboard design, reflecting the observation that plagiarism can occur among larger groups of students. To meet various user needs, the Dolos software stack for source code plagiarism detections now includes a web interface, a JSON application programming interface (API), a command line interface (CLI), a JavaScript library and a preconfigured Docker container. Clear documentation and a free-to-use instance of the web app can be found at https://dolos.ugent.be. The source code is also available on GitHub.
Paper Structure (7 sections, 5 figures)

This paper contains 7 sections, 5 figures.

Figures (5)

  • Figure 1: Launchpad of the Dolos web app. Left panel: upload form for submitting a new collection of source files. Right panel: searchable table for accessing, deleting and sharing previously submitted collections.
  • Figure 2: The overview dashboard’s analytics and visualisations summarise the plagiarism detection results. This specific report suggests that plagiarism is prevalent in this publicly available collection of source files. The collection info card (top left) displays basic statistics about the collection being analysed. Colour codes for the highest and average pairwise similarities (top centre and bottom left) between files indicate the level of suspicion of plagiarism, ranging from low (green), to average (orange) and high (red). The histogram (top right) and a list (bottom left) display the global similarity with the nearest neighbour of each source file. The composition of clusters (bottom right) represents the source files as circles marked with an acronym derived from their author name, and coloured according to their label. Student subjects are used as labels for this collection of source files. The individual files (bottom left) and clusters (bottom right) are ranked by decreasing suspicion of plagiarism. The web app uses a simple heuristic to determine an appropriate initial similarity threshold for clustering. This threshold can be modified either in the histogram (top right panel) or in the global settings (activated on the far right of the top navigation bar). All dashboards also have a shared setting that anonymises analytics and visualisations (useful for in-class demonstrations) and a label-based filtering for the collection of source files.
  • Figure 3: Graph showing suspected cases of plagiarism within the same collection of source files used for Figure \ref{['fig:overview']}. Each node represents a source file and has a colour that corresponds to its file labels. The legend (top right) can be used to include or exclude files from the graph by label. Edges connect nodes whose pairwise similarity exceeds an adjustable threshold (bottom right), set at 83% global similarity. Clusters of connected nodes are grouped within regions whose background colour reflects the dominant colour of the cluster nodes. Source files are excluded from the graph view if their global similarity with the nearest neighbour falls below the threshold (i.e. nodes not connected by an edge to any other node in the graph), unless the "Display singletons" option (bottom right) is enabled.
  • Figure 4: The new diff view highlights the differences in the dashboard for comparing two files. In this particular case, the two solutions are almost identical, with only minor syntactic differences such as parameter and variable names, comments and string quotes. It is possible that one of the students made these changes in an attempt to disguise plagiarism.
  • Figure 5: Diagram of the different components in the Dolos ecosystem and their relationships (Dolos version 2.x). Some components can be used in isolation, as shown by the three users interacting with the components. External dependencies and standalone documentation pages (dolos-docs) have been excluded.