Mistake 2: Reporting train performance

Reporting train performance metrics instead of test performance.

The ultimate goal of computing performance metrics is to estimate the generalization performance of your model. That is, you want to know how it is going to behave when fed with new data. This is the main reason for dividing the dataset into train and test sets. Thus, you typically train a model using the train set and evaluate its performance with the test set. The test set performance is what you report most of the time. However, it is easy to mix the sets when coding your program. Take for example, the following code snippet. Here, a decision tree is fitted with the WINE dataset. Before the fitting, the data is split into train and test sets.

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_wine

# Load the wine dataset.
data = load_wine()

# Divide into train and test sets.
X_train, X_test, y_train, y_test = train_test_split(data.data,
                                                    data.target,
                                                    test_size = 0.5,
                                                    random_state = 123)
tree = DecisionTreeClassifier(random_state = 123)
tree.fit(X_train, y_train)
predictions = tree.predict(X_train)
print(f"Decision Tree accuracy: {accuracy_score(y_train, predictions):.3f}")

#>> Decision Tree accuracy: 1.000

What is wrong with the previous code? Actually, there is nothing wrong with it. It runs without any errors and it prints the accuracy (\(1.0\)) as expected. However, there is something suspicious here. The accuracy is perfect. When your results look so good, it may be an indication that there is something going on. If you take a closer look at lines 16 and 17 you will notice that the performance (accuracy) is being computed on the train set. If your aim is to estimate the generalization performance of your model then, you should instead compute the performance on the test set. Even though you already know this, it is common to incur in this type of error when copying and pasting code. If what you are really interested in is the generalization performance, then, lines 16-17 should be changed to:

predictions = tree.predict(X_test)
print(f"Decision Tree accuracy: {accuracy_score(y_test, predictions):.3f}")

#>> Decision Tree accuracy: 0.865

In this case the accuracy was much lower (\(0.865\)) but represents a better estimate of what you would expect when you deploy your system into production.

This does not mean that you should never compute the training performance. In fact, it is also very useful to have this information at hand. For example, for diagnosing purposes when analyzing overfitting.

If your results look suspiciously good, it is worth to double check your code to see if you are not reporting the train performance instead of test performance.