Table of Contents
Fetching ...

Mining the Characteristics of Jupyter Notebooks in Data Science Projects

Morakot Choetkiertikul, Apirak Hoonlor, Chaiyong Ragkhitwetsagul, Siripen Pongpaichet, Thanwadee Sunetnanta, Tasha Settewong, Vacharavich Jiravatvanich, Urisayar Kaewpichai, Raula Gaikovina Kula

TL;DR

This study probes the characteristics of Jupyter notebooks on Kaggle and GitHub to understand what drives high-quality, widely used notebooks. It proposes an exploratory, data-driven methodology that fuses Kaggle data from KGTorrent with GitHub projects, extracting features across notebook structure, code quality, textual content, and visualizations, and then applying statistical and machine-learning analyses to relate these features to contributor rank and project popularity. The anticipated contributions include identifying concrete best-practice features, enabling guidelines and potential automation to improve notebook readability, reproducibility, and maintainability, and informing future cross-platform research on data science artifacts. The work holds practical impact for practitioners, educators, and platform designers seeking to elevate notebook quality and facilitate skill development from novice to deployable project levels.

Abstract

Nowadays, numerous industries have exceptional demand for skills in data science, such as data analysis, data mining, and machine learning. The computational notebook (e.g., Jupyter Notebook) is a well-known data science tool adopted in practice. Kaggle and GitHub are two platforms where data science communities are used for knowledge-sharing, skill-practicing, and collaboration. While tutorials and guidelines for novice data science are available on both platforms, there is a low number of Jupyter Notebooks that received high numbers of votes from the community. The high-voted notebook is considered well-documented, easy to understand, and applies the best data science and software engineering practices. In this research, we aim to understand the characteristics of high-voted Jupyter Notebooks on Kaggle and the popular Jupyter Notebooks for data science projects on GitHub. We plan to mine and analyse the Jupyter Notebooks on both platforms. We will perform exploratory analytics, data visualization, and feature importances to understand the overall structure of these notebooks and to identify common patterns and best-practice features separating the low-voted and high-voted notebooks. Upon the completion of this research, the discovered insights can be applied as training guidelines for aspiring data scientists and machine learning practitioners looking to improve their performance from novice ranking Jupyter Notebook on Kaggle to a deployable project on GitHub.

Mining the Characteristics of Jupyter Notebooks in Data Science Projects

TL;DR

This study probes the characteristics of Jupyter notebooks on Kaggle and GitHub to understand what drives high-quality, widely used notebooks. It proposes an exploratory, data-driven methodology that fuses Kaggle data from KGTorrent with GitHub projects, extracting features across notebook structure, code quality, textual content, and visualizations, and then applying statistical and machine-learning analyses to relate these features to contributor rank and project popularity. The anticipated contributions include identifying concrete best-practice features, enabling guidelines and potential automation to improve notebook readability, reproducibility, and maintainability, and informing future cross-platform research on data science artifacts. The work holds practical impact for practitioners, educators, and platform designers seeking to elevate notebook quality and facilitate skill development from novice to deployable project levels.

Abstract

Nowadays, numerous industries have exceptional demand for skills in data science, such as data analysis, data mining, and machine learning. The computational notebook (e.g., Jupyter Notebook) is a well-known data science tool adopted in practice. Kaggle and GitHub are two platforms where data science communities are used for knowledge-sharing, skill-practicing, and collaboration. While tutorials and guidelines for novice data science are available on both platforms, there is a low number of Jupyter Notebooks that received high numbers of votes from the community. The high-voted notebook is considered well-documented, easy to understand, and applies the best data science and software engineering practices. In this research, we aim to understand the characteristics of high-voted Jupyter Notebooks on Kaggle and the popular Jupyter Notebooks for data science projects on GitHub. We plan to mine and analyse the Jupyter Notebooks on both platforms. We will perform exploratory analytics, data visualization, and feature importances to understand the overall structure of these notebooks and to identify common patterns and best-practice features separating the low-voted and high-voted notebooks. Upon the completion of this research, the discovered insights can be applied as training guidelines for aspiring data scientists and machine learning practitioners looking to improve their performance from novice ranking Jupyter Notebook on Kaggle to a deployable project on GitHub.
Paper Structure (16 sections, 1 figure, 2 tables)