Scikit-learn: Machine Learning in Python
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Andreas Müller, Joel Nothman, Gilles Louppe, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, Édouard Duchesnay
TL;DR
Scikit-learn delivers a broad, consistent, and easy-to-use suite of machine learning algorithms in Python, built on NumPy/SciPy with BSD licensing and community-driven development. The paper details a minimal, interface-driven design, robust code quality practices, and efficient bindings to external libraries for performance. It highlights Pipeline and GridSearchCV for streamlined model selection and workflow composition, and discusses trade-offs between high-level usability and computational efficiency. The result is a practical, extensible toolkit that integrates well with scientific Python workflows and supports expansion to online learning for large-scale data.
Abstract
Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from http://scikit-learn.org.
