Errors and Validation


Training and testing data


In this example, we will use the Iris data as used in the previous chapter. The features and the target are loaded into arrays named X and y, respectively.

import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix, classification_report

# Loading the iris data
iris = datasets.load_iris()
X = iris.data  # array for the features
y = iris.target  # array for the target
feature_names = iris.feature_names   # feature names
target_names = iris.target_names   # target names

Here, we split the Iris data into the training data set (comprising 2/3 of all observations) and the testing data set (comprising the remaining 1/3 of all observations). This is done by the train_test_split function available in the sklearn.model_selection library. This function takes an feature array and a target array as input parameters, and split them into training and testing data sets. In this example, the input features and targets, X and y respectively, are split into the training set (X_train for the features, and y_train for the target) as well as the testing data (likewise X_test and y_test).

# spliting the data into training and testing data sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.333,
                                                    random_state=2020)

Here, the parameter test_size specifies the proportion of the original data to be assigned to the testing data. The order of observations are randomized in both training and testing data.

y # original target labels
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
y_train  # training data target labels
array([0, 2, 1, 0, 1, 1, 2, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 2, 1, 2, 0,
       0, 2, 0, 2, 2, 0, 2, 0, 0, 1, 0, 0, 2, 1, 0, 2, 1, 2, 0, 2, 2, 0,
       1, 2, 0, 2, 1, 1, 2, 1, 0, 2, 1, 0, 1, 1, 1, 2, 1, 0, 2, 0, 0, 1,
       2, 2, 2, 1, 2, 1, 0, 2, 0, 1, 0, 0, 1, 1, 2, 1, 2, 0, 2, 1, 2, 2,
       1, 2, 0, 0, 1, 2, 2, 1, 2, 1, 2, 1])
y_test  # testing data target labels
array([2, 0, 1, 1, 1, 2, 2, 1, 0, 0, 2, 2, 0, 2, 2, 0, 1, 1, 2, 0, 0, 2,
       1, 0, 2, 1, 1, 1, 0, 0, 2, 0, 0, 0, 2, 0, 0, 1, 0, 2, 0, 2, 1, 0,
       1, 2, 2, 1, 1, 1])

The parameter random_state is used to seed (or initialize) the random number generator. It is not a required parameter, but if you want to re-create the same random split, they you can use the same value for the random_state.

Classifier training and prediction


In this example, we train a naive Bayes classifier for the Iris data, and evaluate its performance. We use a Gaussian naive Bayes classifier object GaussianNB available in the sklearn.naive_bayes library. We define the classifier object as gnb. The classifier object is trained with the training data (both features X_train and target y_train) using the fit method. Once the classifier gnb is trained, then it is used to predict target labels based on the testing data features X_test. The predicted labels are stored in y_pred.

# Gaussian naive Bayes classifier 
gnb = GaussianNB()  # defining the classifier object
gnb.fit(X_train, y_train)  # training the classifier with the training set
y_pred = gnb.predict(X_test)  # generating prediction with trained classifier

Now we examine the performance of the classifier by generating a confusion matrix. This is done by the confusion_matrix function available in the sklearn.metrics library. Here, we need to provide the true target labels y_test as well as the predicted labels y_pred.

# confusion matrix
print(confusion_matrix(y_test,y_pred))
[[18  0  0]
 [ 0 14  2]
 [ 0  3 13]]

As you see, there are some misclassifications in the 2nd and 3rd classes (versicolors and virginicas). We can also generate other measures of model performance with the classification_report function under the sklearn.metrics library. This function also takes arrays of target labels for the truth and the predicted. We also provide the list of target class names stored in target_names to the parameter target_names so that the output table has row headings corresponding to different target classes.

# classification report
print(classification_report(y_test, y_pred, target_names=target_names))
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        18
  versicolor       0.82      0.88      0.85        16
   virginica       0.87      0.81      0.84        16

    accuracy                           0.90        50
   macro avg       0.90      0.90      0.90        50
weighted avg       0.90      0.90      0.90        50