B Datasets
This section lists all the datasets used in this book. Each dataset specifies whether or not it is included in scikit-learn. Those that are not part of scikit-learn, are included along with the source code in the data\
directory.
B.1 CALIFORNIA-HOUSING
Included in scikit-learn: Yes
The CALIFORNIA-HOUSING dataset (Pace and Barry 1997) has \(20640\) instances and \(8\) features that describe different characteristics of houses such as average number of rooms, house age, average occupancy, etc. The response variable is the median house value.
B.2 DIAGNOSTIC
Included in scikit-learn: Yes
The DIAGNOSTIC dataset (Wolberg et al. 1993) has \(569\) instances and \(30\) features extracted from images of a fine needle aspirate of a breast mass. The class indicates if the breast tumor is benign or malignant.
B.3 DIGITS
Included in scikit-learn: Yes
The DIGITS dataset contains \(1797\) handwritten digits (\(0\dots9\)) as \(8\times8\) images. This is a subset of the original data (Alpaydin and Kaynak 1998).
B.4 IRIS
Included in scikit-learn: Yes
The IRIS dataset (Fisher 1936) includes information about 150 plants divided into three categories (‘setosa’, ‘virginica’, and ‘versicolor’).
B.5 WINE
Included in scikit-learn: Yes
The WINE dataset (Aeberhard and Forina 1992) has \(178\) instances of wine chemical analyses. Each one is categorized into one of three classes (three different cultivators in Italy).
B.6 WISDM
Included in scikit-learn: No
This dataset is called WISDM and was made available by Kwapisz, Weiss, and Moore (2010). The dataset includes \(6\) different activities: ‘walking’, ‘jogging’, ‘walking upstairs’, ‘walking downstairs’, ‘sitting’, and ‘standing’. The data was collected by \(36\) volunteers with the accelerometer of an Android phone located in the users’ pants pocket and with a sampling rate of \(20\) Hz.