Table of Contents
Fetching ...

$\textit{lucie}$: An Improved Python Package for Loading Datasets from the UCI Machine Learning Repository

Kenneth Ge, Phuc Nguyen, Ramy Arnaout

Abstract

The University of California--Irvine (UCI) Machine Learning (ML) Repository (UCIMLR) is consistently cited as one of the most popular dataset repositories, hosting hundreds of high-impact datasets. However, a significant portion, including 28.4% of the top 250, cannot be imported via the $\textit{ucimlrepo}$ package that is provided and recommended by the UCIMLR website. Instead, they are hosted as .zip files, containing nonstandard formats that are difficult to import without additional ad hoc processing. To address this issue, here we present $\textit{lucie}$ -- $\underline{l}oad$ $\underline{U}niversity$ $\underline{C}alifornia$ $\underline{I}rvine$ $\underline{e}xamples$ -- a utility that automatically determines the data format and imports many of these previously non-importable datasets, while preserving as much of a tabular data structure as possible. $\textit{lucie}$ was designed using the top 100 most popular datasets and benchmarked on the next 130, where it resulted in a success rate of 95.4% vs. 73.1% for $\textit{ucimlrepo}$. $\textit{lucie}$ is available as a Python package on PyPI with 98% code coverage.

$\textit{lucie}$: An Improved Python Package for Loading Datasets from the UCI Machine Learning Repository

Abstract

The University of California--Irvine (UCI) Machine Learning (ML) Repository (UCIMLR) is consistently cited as one of the most popular dataset repositories, hosting hundreds of high-impact datasets. However, a significant portion, including 28.4% of the top 250, cannot be imported via the package that is provided and recommended by the UCIMLR website. Instead, they are hosted as .zip files, containing nonstandard formats that are difficult to import without additional ad hoc processing. To address this issue, here we present -- -- a utility that automatically determines the data format and imports many of these previously non-importable datasets, while preserving as much of a tabular data structure as possible. was designed using the top 100 most popular datasets and benchmarked on the next 130, where it resulted in a success rate of 95.4% vs. 73.1% for . is available as a Python package on PyPI with 98% code coverage.

Paper Structure

This paper contains 8 sections, 3 figures.

Figures (3)

  • Figure 1: The 100 most popular UCIMLR datasets, ordered by number of views and presented using a log10 scale. Green=datasets that ucimlrepo successfully imports; red=ucimlrepo failures. The UCIMLR ID# is displayed along with the name and rank. Note the correlation between success and popularity (more red bars toward the bottom).
  • Figure 2: Flowchart of the lucie algorithm, with question boxes corresponding to the steps described in the text. The green oval indicates success, the red oval indicates failure, and the gray ovals indicate an action.
  • Figure 3: Percent success for ucimlrepo (left) and lucie (right). Green=successful imports; red=failed imports. Percents are shown for the 100 most popular datasets, 130 of the 150 next most popular datasets (separated by dotted line), and these 100+130=230 datasets; the remaining 20 were excluded for having broken/external links (dark gray) and large datasets/timeouts (light gray). Integers in the bars are no. datasets. The 100 most popular datasets, on which lucie was developed, are on the bottom.