Mistake 3: Not setting a seed value

Not setting a seed value when using functions that rely on random number generators can make it difficult to reproduce the results.

When working in machine learning, most of the time you will be dealing with non-deterministic functions. Non-deterministic means that some randomness is involved in the process, and typically, this happens when making a call (directly or indirectly) to a random number generator. For example, when splitting the data into train and test sets, you do so with the train_test_split() function from scikit-learn. The following code snippet loads the WINE dataset and splits it into a train and a test set. Then, it prints the first ten classes in the train set.

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_wine

# Load the wine dataset.
data = load_wine()

# Divide into train and test sets.
X_train, X_test, y_train, y_test = train_test_split(data.data,
                                                    data.target,
                                                    test_size=0.5)

# Print the first classes in the train set.
print(y_train[0:10])

#>> [1 0 1 1 0 2 1 1 1 2]

If you run the previous code multiple times, you will get different results every time. And in fact, this is the behavior you are expecting since the data should be split randomly. At this point you may be wondering: Where is the error? or Why is this being considered as a mistake? The reason I categorize this as a mistake is because in some circumstances, including when you publish your results or share your code with someone else, it will be difficult to get the same result again and again. But, isn’t this contradictory? On one hand, you want to randomly split the data, but on the other hand, you want to get the same results every time. The answer is that since random number generators are not truly random but pseudo-random, then, you can achieve both objectives. You can simulate a random process and at the same time make it reproducible. To do so, you can set a fixed seed value for the random generator. Thus, every time you call a function based on a random number generator it will produce the same result with respect to the initial seed value. In scikit-learn, you can set the seed value for various of its functions through the random_state parameter.

The following code adds the random_state = 123 parameter to the train_test_split() function. This parameter accepts an integer number.

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_wine

# Load the wine dataset.
data = load_wine()

# Divide into train and test sets.
X_train, X_test, y_train, y_test = train_test_split(data.data,
                                                    data.target,
                                                    test_size=0.5,
                                                    random_state = 123)

# Print the first classes in the train set.
print(y_train[0:10])

#>> 2 2 2 1 0 0 1 0 1 1

Now every time you run the previous code it will generate the same result based on the value of the random_state parameter. If you change its value, a different train/test split will be generated but it is guaranteed that you will get the same result if you use the same integer value.

If you are training a model that relies on random number generators, then, you should set the random_state parameter as well. For example, the RandomForestClassifier picks random features to fit each of its trees. The following code crates a random forest model and sets its random state.

from sklearn.ensemble import RandomForestClassifier

# Instantiate a RandomForestClassifier and set its random state.
rf = RandomForestClassifier(n_estimators=50, random_state=123)

Be aware that sometimes even when setting the random_state parameter, you may get different results. This can happen if you run the same code but using different library versions. It has been reported that some scikit-learn functions produce different results across versions (Guigui14460 2021). I have also experienced some incongruencies with the RandomForestClassifier when running it with different versions of scikit-learn. Apart from setting the random seed, you should also document which version of Python and libraries were used in your program.

In scikit-learn you can set the random_state parameter in several functions to make sure you always get the same results.

References

Guigui14460. 2021. Differences in Scores Between Two Different Versions. https://github.com/scikit-learn/scikit-learn/issues/20042.