A nearest-neighbor classifier based on the Euclidean distance is implemented in the
package class
in R. To show how to use the nearest-neighbor classifier in R, we
use splitting of the Iris data set into a training set iris.training
and test set
iris.test
as it was demonstrated in Sect. 5.6.2. The function knn
requires a
training and a set with only numerical attributes and a vector containing the classifications
for the training set. The parameter k
determines how many nearest neighbors
are considered for the classification decision.
# generating indices for shuffling
n <- length(iris$Species)
permut <- sample(c(1:n),n,replace=F)
# shuffling the observations
ord <- order(permut)
iris.shuffled <- iris[ord,]
# splitting into training and testing data
prop.train <- 2/3 # training data consists of 2/3 of observations
k <- round(prop.train*n)
iris.training <- iris.shuffled[1:k,]
iris.test <- iris.shuffled[(k+1):n,]
library(class)
iris.knn <- knn(iris.training[,1:4],iris.test[,1:4],iris.training[,5],k=3)
table(iris.knn,iris.test[,5])
The last line prints the confusion matrix.
For the example of multilayer perceptrons in R, we use the same training and test
data as for the nearest-neighbor classifier above. The multilayer perceptron can only
process numerical values. Therefore, we first have to transform the categorical attribute
Species
into a numerical attribute:
x <- iris.training
x$Species <- as.numeric(x$Species)
The multilayer perceptron is constructed and trained in the following way, where
the library neuralnet
needs to be installed first:
library(neuralnet)
iris.nn <- neuralnet(Species + Sepal.Length ~
Sepal.Width + Petal.Length + Petal.Width, x,
hidden=c(3))
The first argument of neuralnet
defines that the attributes Species
and sepal
length
correspond to the output neurons. The other three attributes correspond to
the input neurons. x
specifies the training data set. The parameter hidden
defines
how many hidden layers the multilayer perceptron should have and how many neurons
in each hidden layer should be. In the above example, there is only one hidden
layer with three neurons. When we replace c(3)
by c(4,2)
, there would be two
hidden layers, one with four and one with two neurons.
The training of the multilayer perceptron can take some time, especially for larger data sets.
When the training is finished, the multilayer perceptron can be visualized:
plot(iris.nn)
The visualization includes also dummy neurons as shown in Fig. 9.4.
The output of the multilayer perceptron for the test set can be calculated in the following way. Note that we first have to remove the output attributes from the test set:
y <- iris.test
y <- y[-5]
y <- y[-1]
y.out <- compute(iris.nn,y)
We can then compare the target outputs for the training set with the outputs from
the multilayer perceptron. If we want to compute the squared errors for the second
output neuron -— the sepal length
—- we can do this in the following way:
y.sqerr <- (y[1] - y.out$net.result[,2])^2
For support vector machine, we use the same training and test data as already for the
nearest-neighbor classifier and for the neural networks. A support vector machine to
predict the species
in the Iris data set based on the other attributes can be constructed
in the following way. The package e1071
is needed and should be installed first if
it has not been installed before:
library(e1071)
iris.svm <- svm(Species ~ ., data = iris.training)
table(predict(iris.svm,iris.test[1:4]),iris.test[,5])
The last line prints the confusion matrix for the test data set.
The function svm
works also for support vector regression. We could, for instance,
use
iris.svm <- svm(Petal.Width ~ ., data = iris.training)
sqerr <- (predict(iris.svm,iris.test[-4])-iris.test[4])^2
to predict the numerical attribute petal width
based on the other attributes and to
compute the squared errors for the test set.
As an example for ensemble methods, we consider random forest with the training
and test data of the Iris data set as before. The package randomForest
needs to
be installed first:
library(randomForest)
iris.rf <- randomForest(Species ~., iris.training)
table(predict(iris.rf,iris.test[1:4]),iris.test[,5])
In this way, a random forest is constructed to predict the species
in the Iris data set
based on the other attributes. The last line of the code prints the confusion matrix
for the test data set.