Mistake 8: Not comparing against a baseline

Not comparing your models against a baseline.

Sometimes you may be tempted to try to solve a problem with a complex machine learning model. You proceed to train and evaluate your complex model and obtain a high performance estimate. So you decide to deploy the model into production. Later, you realize that the expected benefits are not materializing (increased sales, waste reduction, click rates, etc.). What could have gone wrong? It is a common mistake to assume that a good performance estimate (accuracy, recall, etc.) of your model means that the solution to the actual problem will be efficient as well. However, models learn from the data, not from the problem itself. If the data does not characterize the problem well, then, it is likely that the model will not be useful to solve that problem (even if the performance metrics look great). There can be underlying problems with the data that can cause a model to produce good predictions but not useful at all. Examples of such problems include class imbalance and/or having features with no information relevant to the problem.

One way to spot such problems is to compare your complex models against base line models. If a simple baseline model performs similar to a complex one, it may be an indication that the later is not learning patterns from the data. It also happens that a problem may be so difficult (e.g., forecasting), that performing slightly better than random is a huge achievement.

A baseline model not necessarily needs to be machine-learning-based. It could be something as simple as predicting the mean (for regression), or using a manually set threshold in one of the features to decide the label. Baseline models are also called dummy models. Some dummy models for classification include:

Most-frequent-class. This model always predicts the most frequent class. So it does not even need to analyze the features.
Uniform predictions. This model predicts classes at random with equal probability.
Prior This one predicts a class based on its prior probability.

For regression, a baseline model can predict the mean value. In the case of timeseries data, the prediction at time \(t_i\) can be the value at \(t_{i-1}\), or the average across the previous \(n\) values.

For the following example, a synthetic dataset with two classes is created. It consists of \(1000\) rows and \(10\) features. The dataset is forced to be imbalanced with weights=[0.95, 0.05].

X, y = make_classification(n_samples = 1000, n_features = 10,
                           n_redundant = 0, n_classes = 2, 
                           weights=[0.95, 0.05], random_state = 123)

Next, a neural network (Multi-Layer Perceptron MLP) is trained on the synthetic data.

nn = MLPClassifier(hidden_layer_sizes=(5, 2), max_iter = 1000, random_state = 98)
nn.fit(X_train, y_train)
y_pred = nn.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f" Neural Network accuracy {accuracy:.2f}")

#>> Neural Network accuracy 0.95

The final accuracy is \(0.95\), which looks great. However, you are not convinced about the results and you decide to train a baseline model. Sci-kit learn provides a DummyClassifier and DummyRegressor. The following code fits a most-frequent DummyClassifier. The strategy argument specifies the type of dummy model: “most_frequent”, “prior”, “stratified”, “uniform”, or “constant”.

from sklearn.dummy import DummyClassifier
dummy = DummyClassifier(strategy='most_frequent', random_state = 123)
dummy.fit(X_train, y_train)
y_pred = dummy.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f" Dummy classifier (most frequent) accuracy {accuracy:.2f}")

#>> Dummy classifier (most frequent) accuracy 0.95

To our surprise, a most-frequent model performed the same as the neural network. This means that even though the accuracy seems high, the neural network is not providing any real benefit. In this case, the reason being that the classes are not balanced.

dummy = DummyClassifier(strategy='uniform', random_state = 123)

#>> Dummy classifier (uniform) accuracy 0.48

The accuracy of the uniform model was \(0.48\). If our neural network had an accuracy of \(\approx 0.48\) (which was not the case) it would had been an indication that the model was not learning.

You can use the classes DummyClassifier() and DummyRegressor() to create simple baselines to compare your models.