1. Dataset loading utilities

The sklr.datasets module provides some toy and real-world datasets commonly used by the Machine Learning community to benchmark algorithms.

1.1. General dataset API

The main dataset interface can be used to load standard datasets and real-world datasets.

These functions return a tuple (X, Y) consisting of a (n_samples, n_features) ndarray X and an array of shape (n_samples, n_classes) containing the target rankings Y.

1.2. Toy datasets

We provide standard datasets that can be loaded using the following functions:

load_authorship

load_blocks

load_bodyfat

load_breast

load_calhousing

load_cpu

load_ecoli

load_elevators

load_fried

load_glass

load_housing

load_iris

load_letter

load_libras

load_pendigits

load_satimage

load_segment

load_stock

load_vehicle

load_vowel

load_wine

load_wisconsin

load_yeast

These datasets were obtained by transforming multiclass and continuous (regression) problems of the UCI Machine Learning Repository to the label ranking problem and the partial label ranking problem.

References:

  • W. Cheng, J. Hühn and E. Hüllermeier, “Decision tree and instance-based learning for label ranking”, In Proceedings of the 26th International Conference on Machine Learning, pp. 161-168, 2009.

  • J. C. Alfaro, J. A. Aledo and J. A. Gámez, “Learning decision trees for the Partial Label Ranking problem”, International Journal of Intelligent Systems, 2020, Submitted.

1.2.1. Anacaldata authorship dataset

This dataset belongs to a collection of datasets used to analyze categorical data.

References:

      1. Simonoff, “Analyzing Categorical Data”, Springer-Verlag, 2003.

1.2.2. Page blocks dataset

This dataset was obtained by a segmentation process of all the blocks of the page layout of a document.

References:

  • D. Malerba, F. Esposito and G. Semeraro, “A Further Comparison of Simplification Methods for Decision-Tree Induction”, In Learning from Data: Artificial Intelligence and Statistics, pp. 365-374, 1995.

  • F. Esposito, D. Malerba and G. Semeraro, “Multistrategy Learning for Document Recognition”, Applied Artificial Intelligence, vol. 8, pp. 33-84, 1994.

1.2.3. Body fat dataset

This dataset contains estimates of the percentage of body fat determined by underwater weighting and body circumference measurements for men.

References:

  • A. R. Behnke and J. H. Wilmore, “Evaluation and Regulation of Body Build and Composition”, Prentice-Hall, 1974.

  • W. E. Siri, “The Gross Composition of the Body”, Advances in Biological and Medical Physics, vol. 4, pp. 239-280, 1956.

1.2.4. Breast tissue dataset

This dataset measures freshly excised breast tissues, which, plotted in the plane, constitute the impedance spectrum from where the breast tissue features are computed.

References:

  • J. Jossinet, “Variability of impedivity in normal and pathological breast tissue”, Medical & Biological Engineering & Computing, vol. 34, pp. 346-350, 1996.

  • J. E. Silva, J. P. de Sá, J. Jossinet, “Classification of Breast Tissue by Electrical Impedance Spectroscopy”, Medical & Biological Engineering & Computing, vol. 38, pp. 26-30, 2000.

1.2.5. California housing dataset

This dataset was derived from the 1990 U.S. census, using one row per census block group (a block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data).

References:

  • R. K. Pace and R. Barry, “Sparse Spatial Autoregressions”, Statistics and Probability Letters, vol. 33, pp. 291-297, 1997.

1.2.6. CPU small dataset

This dataset measure computer systems activity by means of (restricted) attributes and the objective is to predict when the CPU is free in a certain portion of time.

References:

  • O. Okun, G. Valentini and M. Re, “Ensembles in Machine Learning Applications”, Springer-Verlag, 2011.

1.2.7. Ecoli dataset

This dataset contains protein localization sites.

References:

  • P. Horton and K. Nakai, “A Probablistic Classification System for Predicting the Cellular Localization Sites of Proteins”, In Proceedings of the Fourth International Conference on Intelligent Systems for Molecular Biology, pp. 109-115, 1996.

1.2.8. Elevators dataset

This dataset is obtained from the task of controlling a F16 aircraft, and the objective is related to an action taken on the elevators of the aircraft according to the status attributes of the aeroplane.

References:

  • R. Camacho, “Inducing models of human control skills using Machine Learning algorithms”, 2000.

1.2.9. Friedman dataset

This is an artificial dataset consisting of independent attributes which are uniformly distributed. To obtain the value of the target variable, the following equation is used:

\[Y = 10 \sin(\pi X_1 X_2) + 20 (X_3 - 0.5)^2 + 10 X_4 + 5 X_5 + \sigma(0, 1)\]

References:

  • J. Friedman, “Multivariate Adaptative Regression Splines”, The Annals of Statistics, vol. 19, pp. 1-67, 1991.

  • L. Breiman, “Bagging predictors”, Machine Learning, vol. 24, pp. 123–140, 1996.

1.2.10. Glass identification dataset

This dataset was motivated by criminological investigation to study the classification of types of glass according to their chemical properties.

References:

    1. Evett and E. J. Spiehler, “Rule Induction in Forensic Science”, 1987.

1.2.11. Boston housing dataset

This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass.

References:

  • D. Harrison and D. L. Rubinfeld, “Hedonic prices and the demand for clean air”, Journal of Environmental Economics and Management, vol. 5, pp. 81-102, 1978.

1.2.12. Iris dataset

This is perhaps the best known dataset to be found in the pattern recognition literature. Fisher’s paper is a classic in the field and is referenced frequently to this day.

References:

  • R. A. Fisher, “The use of multiple measurements in taxonomic problems”, Annual Eugenics, vol. 7, pp. 179-188, 1936.

  • R. O. Duda and P. E. Hart, “Pattern Classification and Scene Analysis”, John Wiley & Sons, 1973.

  • B. V. Dasarathy, “Nosing Around the Neighborhood: A New System Structure and Classification Rule for Recognition in Partially Exposed Environments”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 2, pp. 67-71, 1980.

  • G. W. Gates, “The Reduced Nearest Neighbor Rule”, IEEE Transactions on Information Theory, vol. 18, pp. 431-433, 1972.

1.2.13. Letter recognition dataset

This dataset contains a large number of black-and-white rectangular pixel displays as one of the capital letters in the English alphabet.

References:

  • P. W. Frey and D. J. Slate, “Letter Recognition Using Holland-style Adaptive Classifiers”, Machine Learning, vol. 6, pp. 161–182, 1991.

1.2.14. Libras movement dataset

This dataset consists of classyfing references to a hand movement type according to a mapping operation representing the coordinates of movement.

References:

  • D. B. Dias, R. C. B. Madeo, T. Rocha, H. H. Bíscaro and S. M. Peres, “Hand Movement Recognition for Brazilian Sign Language: A Study Using Distance-Based Neural Networks”, In Proceedings of the International Joint Conference on Neural Networks, pp. 697-704, 2009.

1.2.15. Pen-based recognition of handwritten digits dataset

This dataset contains samples arising from handwritten digits characterized by pen trajectories (successive pen points on a coordinate system).

References:

  • F. Alimoglu, “Combining Multiple Classifiers for Pen-Based Handwritten Digit Recognition”, 1996.

  • F. Alimoglu and E. Alpaydin, “Methods of Combining Multiple Classifiers Based on Different Representations for Pen-based Handwriting Recognition”, In Proceedings of the Fifth Turkish Artificial Intelligence and Artificial Neural Networks Symposium, 1996.

1.2.16. Landsat satellite dataset

This dataset consists of the multi-spectral values of pixels in 3x3 neighbourhoods in a satellite image, and the classification associated with the central pixel in each neighbourhood.

1.2.17. Image segmentation dataset

This dataset contains image data described by high-level attributes of outdoor images (hand-segmented to create a classification for every pixel).

1.2.18. Stock prices dataset

This dataset provides daily stock prices from January 1988 through October 1991, for 10 aerospace companies. The objective is to aproximate the price of the 10th company given the prices of the rest.

References:

  • H. Altay and I. Uysal, “Bilkent University Function Approximation Repository”, 2000.

1.2.19. Vehicle silhouettes dataset

This dataset purpose is to classify a given silhouette as one of four types of vehicle, using a set of features extracted from the silhouette (the vehicle may be viewed from one of many different angles).

References:

      1. Siebert, “Vehicle Recognition Using Rule Based Methods”, 1987.

1.2.20. Vowel recognition dataset

This dataset consists of a three dimensional array: speaker, vowel and input. The speakers and vowels are indexed by integers and, for each utterance, there are floating-point input values.

1.2.21. Wine dataset

This dataset is the result of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of constituents found in each of the types of wines.

References:

  • S. Aeberhard, D. Coomans and O. de Vel, “Comparative analysis of statistical pattern recognition methods in high dimensional settings”, Pattern Recognition, vol. 27, pp. 1065-1077, 1994.

  • S. Aeberhard, D. Coomans and O. de Vel, “Improvements to the classification performance of RDA”, Journal of Chemometrics, vol. 7, pp. 99-115, 1993.

1.2.22. Breast cancer wisconsin dataset

This dataset contains features computed from a digitized image of a fine needle aspirate of a breast mass. They describe characteristics of the cell nuclei present in the image.

References:

  • W. N. Street, O. L. Mangasarian and W. H. Wolberg, “An inductive learning approach to prognostic prediction”, In Proceedings of the Twelfth International Conference on Machine Learning, pp. 522-530, 1995.

  • O. L. Mangasarian, W. N. Street and W. H. Wolberg, “Breast cancer diagnosis and prognosis via linear programming”, Operations Research, vol. 43, pp. 570-577, 1995.

  • W. H. Wolberg, W. N. Street, D. M. Heisey and O. L. Mangasarian, “Computerized breast cancer diagnosis and prognosis from fine needle aspirates”, Archives of Surgery, vol. 130, pp. 511-516, 1995.

  • W. H. Wolberg, W. N. Street and O. L. Mangasarian, “Image analysis and Machine Learning applied to breast cancer diagnosis and prognosis”, Analytical and Quantitative Cytology and Histology, vol. 17, pp. 77-87, 1995.

  • W. H. Wolberg, W. N. Street, D. M. Heisey and O. L. Mangasarian, “Computer-derived nuclear grade and breast cancer prognosis”, Analytical and Quantitative Cytology and Histology, vol. 17, pp. 257-264, 1995.

1.2.23. Yeast dataset

This dataset consists of predicting the cellular localization sites of proteins.

References:

  • P. Horton and K. Nakai, “A Probabilistic Classification System for Predicting the Cellular Localization Sites of Proteins”, In Proceedings of the International Conference on Intelligent Systems for Molecular Biology, pp. 109-115, 1996.

1.3. Real-world datasets

We provide the following functions to load real-world datasets:

load_cold

load_diau

load_dtt

load_heat

load_spo

These datasets originate from the bioinformatics fields considering two types of genetic data, namely phylogenetic profiles and microarray expression data for the Yeast genome. The Yeast genome consists of genes, and each gene is represented by an associated phylogenetic profile. Using these profiles as input features, the expression profile of a gene is ordered into ranks. The use of five microarray experiments (spo, heat, dtt, cold, diau), gives rise to five prediction problems allusing the same input features but different target rankings.

References:

  • E. Hüllermeier, J. Fürnkranz, W. Cheng and K. Brinker, “Label ranking by learning pairwise preferences”, Artificial Intelligence, vol. 172, pp. 1897–1916, 2008.