In this example, we will use the Iris data as used in the previous chapter. The features and the target are loaded into arrays named X
and y
, respectively.
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix, classification_report
# Loading the iris data
iris = datasets.load_iris()
X = iris.data # array for the features
y = iris.target # array for the target
feature_names = iris.feature_names # feature names
target_names = iris.target_names # target names
Here, we split the Iris data into the training data set (comprising 2/3 of all observations) and the testing data set (comprising the remaining 1/3 of all observations). This is done by the train_test_split
function available in the sklearn.model_selection
library. This function takes an feature array and a target array as input parameters, and split them into training and testing data sets. In this example, the input features and targets, X
and y
respectively, are split into the training set (X_train
for the features, and y_train
for the target) as well as the testing data (likewise X_test
and y_test
).
# spliting the data into training and testing data sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.333,
random_state=2020)
Here, the parameter test_size
specifies the proportion of the original data to be assigned to the testing data. The order of observations are randomized in both training and testing data.
y # original target labels
y_train # training data target labels
y_test # testing data target labels
The parameter random_state
is used to seed (or initialize) the random number generator. It is not a required parameter, but if you want to re-create the same random split, they you can use the same value for the random_state
.
In this example, we train a naive Bayes classifier for the Iris data, and evaluate its performance. We use a Gaussian naive Bayes classifier object GaussianNB
available in the sklearn.naive_bayes
library. We define the classifier object as gnb
. The classifier object is trained with the training data (both features X_train
and target y_train
) using the fit
method. Once the classifier gnb
is trained, then it is used to predict target labels based on the testing data features X_test
. The predicted labels are stored in y_pred
.
# Gaussian naive Bayes classifier
gnb = GaussianNB() # defining the classifier object
gnb.fit(X_train, y_train) # training the classifier with the training set
y_pred = gnb.predict(X_test) # generating prediction with trained classifier
Now we examine the performance of the classifier by generating a confusion matrix. This is done by the confusion_matrix
function available in the sklearn.metrics
library. Here, we need to provide the true target labels y_test
as well as the predicted labels y_pred
.
# confusion matrix
print(confusion_matrix(y_test,y_pred))
As you see, there are some misclassifications in the 2nd and 3rd classes (versicolors and virginicas). We can also generate other measures of model performance with the classification_report
function under the sklearn.metrics
library. This function also takes arrays of target labels for the truth and the predicted. We also provide the list of target class names stored in target_names
to the parameter target_names
so that the output table has row headings corresponding to different target classes.
# classification report
print(classification_report(y_test, y_pred, target_names=target_names))