B Datasets

This section lists all the datasets used in this book. Each dataset specifies whether or not it is included in scikit-learn. Those that are not part of scikit-learn, are included along with the source code in the data\ directory.

B.1 CALIFORNIA-HOUSING

Included in scikit-learn: Yes

The CALIFORNIA-HOUSING dataset (Pace and Barry 1997) has \(20640\) instances and \(8\) features that describe different characteristics of houses such as average number of rooms, house age, average occupancy, etc. The response variable is the median house value.

B.2 DIAGNOSTIC

Included in scikit-learn: Yes

The DIAGNOSTIC dataset (Wolberg et al. 1993) has \(569\) instances and \(30\) features extracted from images of a fine needle aspirate of a breast mass. The class indicates if the breast tumor is benign or malignant.

B.3 DIGITS

Included in scikit-learn: Yes

The DIGITS dataset contains \(1797\) handwritten digits (\(0\dots9\)) as \(8\times8\) images. This is a subset of the original data (Alpaydin and Kaynak 1998).

B.4 IRIS

Included in scikit-learn: Yes

The IRIS dataset (Fisher 1936) includes information about 150 plants divided into three categories (‘setosa’, ‘virginica’, and ‘versicolor’).

B.5 WINE

Included in scikit-learn: Yes

The WINE dataset (Aeberhard and Forina 1992) has \(178\) instances of wine chemical analyses. Each one is categorized into one of three classes (three different cultivators in Italy).

B.6 WISDM

Included in scikit-learn: No

This dataset is called WISDM and was made available by Kwapisz, Weiss, and Moore (2010). The dataset includes \(6\) different activities: ‘walking’, ‘jogging’, ‘walking upstairs’, ‘walking downstairs’, ‘sitting’, and ‘standing’. The data was collected by \(36\) volunteers with the accelerometer of an Android phone located in the users’ pants pocket and with a sampling rate of \(20\) Hz.

References

Aeberhard, Stefan, and M. Forina. 1992. Wine.” UCI Machine Learning Repository. https://doi.org/10.24432/C5PC7J.
Alpaydin, E., and C. Kaynak. 1998. Optical Recognition of Handwritten Digits.” UCI Machine Learning Repository. https://doi.org/10.24432/C50P49.
Fisher, Ronald A. 1936. “The Use of Multiple Measurements in Taxonomic Problems.” Annals of Eugenics 7 (2): 179–88.
Kwapisz, Jennifer R., Gary M. Weiss, and Samuel A. Moore. 2010. “Activity Recognition Using Cell Phone Accelerometers.” In Proceedings of the Fourth International Workshop on Knowledge Discovery from Sensor Data (at KDD-10), Washington DC.
Pace, R Kelley, and Ronald Barry. 1997. “Sparse Spatial Autoregressions.” Statistics & Probability Letters 33 (3): 291–97.
Wolberg, William., Olvi. Mangasarian, Nick. Street, and W. Street. 1993. Breast Cancer Wisconsin (Diagnostic).” UCI Machine Learning Repository. https://doi.org/10.24432/C5DW2B.