Mistake 11: Not shuffling the training data

Not shuffling the data can degrade the performance of some models.

In machine learning, some models are sensitive to data ordering, specially those based on gradient optimization and/or batch learning such as neural networks and many of their variants like Convolutional Neural Networks, Autoencoders, Generative Adversarial Networks, and so on. Shuffling the data (rows) helps to avoid bias from data ordering, improve generalization, and stabilize parameter learning. Shuffling helps having more representative batches of the overall data distribution.

Not all machine learning models are sensitive to data ordering. For example, KNN or tree based models like decision trees, random forest, and XGBoost are not affected by data ordering.

Bear in mind that shuffling is not recommended in some cases and should be avoided when fitting time-dependent models such as ARIMA and Recurrent Neural Networks. In sequence-based data, temporal order is crucial and shuffling alters the time-dependencies.

The following code loads the WISDM dataset. Figure 11.1 shows the first ten rows.

import pandas as pd
df = pd.read_csv("data/wisdm/WISDM_ar_v1.1_transformed.csv")
df.head(10)

Figure 11.1: First rows of WISDM dataset.

Here, you can see that the rows are sorted by user. You can easily shuffle the rows with the sample() function as follows.

shuffled_df = df.sample(frac = 1, random_state = 1234).reset_index(drop=True)
shuffled_df.head(10)

Figure 11.2: First rows of WISDM dataset after shuffling.

Figure 11.2 shows the result of shuffling the rows. The frac = 1 parameter indicates the fraction of data to be sampled without replacement. In this case, we want all data back so it is set to \(1\). When splitting the data using train_test_split(), the data will also be returned in a random order.

As a good practice, the train set should be shuffled before fitting models that assume independence between instances.